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ABSTRACT 

This  paper  examines  iterative  and  direct  methods  for  solving  Poisson's 
iquation  with  regard  to  their  adaptation  to  ILLIAC  IV.   SOR,  SLOP,  ADI,  and  FACR 
,re  programmed  in  GLYPNIR.   Detailed  suggestions  on  ASK  code  for  these  methods 
re  also  supplied. 

FACR,  Fourier  Analysis  and  Cyclic  Reduction,  is  the  fastest  method  on 
■ectangular  meshes.   SOR,  Successive  Over  Relaxation,  seems  to  be  the  most 
iromising  for  nonrectangular  meshes.   The  methods  are  between  thirty  and  forty - 
ive  times  faster  on  ILLIAC  IV  than  on  a  serial  machine  with  speed  equal  to  one 
)f  the  ILLIAC  IV  PEs  (Processing  Elements). 


TABLE  OF  CONTENTS 

Page 

1.  POISSON'S   EQUATION   IN  MATRIX  FORM 1 

2.  IMPLEMENTATION  OF   ITERATIVE  METHODS  ON   ILLIAC    IV 6 

2.1  Introduction  to   SLOR  and  SOR 6 

2.2  Introduction  to  ADI 9 

2.3  Implementation  of  SOR  on   ILLIAC   IV 11 

2.3.1  A  Parallel  Processor's   Effect   on  the  Algorithm.    ...  11 

2.3.2  Parallelisms   in  SOR 12 

2.3.3  SOR  Using  Straight   Storage l6 

2.3.1!     SOR  Using  Odd-Even  Storage 20 

2.3-5     SOR  Machine  Efficiency 23 

2.1+      Implementation   of  SLOR  on   ILLIAC    IV 2k 

2.U.1     Parallelism  in  SLOR 2k 

2.U.2     Storage  for  SLOR 25 

2.U.3      Improving  a  Line  by  the  Cuthill-Varga  Method 25 

2.k.k     SLOR   Implementation 28 

2.5  Implementation  of  ADI  on   ILLIAC   IV 31 

2.5.1  Thomas'    Method  on   ILLIAC   IV 32 

2.5.2  Skewed  Storage 33 

2.6  Summary  of  the  Results   for  Iterative  Methods 39 

3.  IMPLEMENTATION  OF  DIRECT  METHODS  ON    ILLIAC    IV    kk 

3.1  Introduction  to   Hockney's  Direct  Method   (FACR)    kk 

3.2  Odd/Even  Reduction   and  Odd  Column   Solution    kG 

3.3  CRED k9 

3.k     Fourier  Analysis  and  Synthesis  53 

3-5   Implementation  of  MFACR 60 


page 

3.6  Implementation  of  FACR 62 

3.7  Summary  of  the  Results  on  Direct  Methods 62 

k.      USE  OF  ILLIAC  DISK  FOR  NON-CORE  CONTAINED  MESHES 65 

k.l     ILLIAC  IV  I/O  System 65 

k.2     I/O  for  the  Bernard-Rayleigh  Convection  Problem ,  65 

5.   CONCLUSIONS 71 

REFERENCES 7  3 

APPENDIX  A.   TIMING  METHODOLOGY  A-l 

APPENDIX  B.   MAJOR  STEPS  OF  FACR  AND  MFACR  FOR  DIRICHLET'S 

BOUNDARY  CONDITIONS B-l 

APPENDIX  C.   SUCCESSIVE  (POINT)  OVER-RELAXATION  METHOD  IN  MSOR.  .  C-l 

APPENDIX  D.   SUCCESSIVE  LINE  OVER-RELAXATION  METHOD  IN 

GLYPNIR,  SLOR D-l 

APPENDIX  E.      PEACEMAN-RACHFORD  ADI    IN  GLYPNIR E-l 

APPENDIX  F.       HOCKNEY'S  DIRECT  METHOD,   MODIFIED,   MFACR    F-l 


LIST  OF  TABLES 

Page 

Table  1.   Parallelism  of  SOR 13 

Table  2.   Timing  of  an  Execution  of  Phase  1  SOR  using  Straight 

Storage 19 

Table   3.      Timing  of  an   Execution  of  Phase  1   SOR  using  Odd-Even 

Storage 22 

Table  k.      Timing  of  an  Execution  of  Phase  1   SOR 30 

Table  5-      Timing  of  an  Execution  of  Phase  1  ADI 38 

Table  6.      A  Numerical   Study  of  the  Methods   under  Discussion 39 

Table  7-      Summary  of  Time   and  Storage  Requirements   for  the   Iterative 

Methods  on   a  m  by   n  Mesh 1+0 

Table    8.      Comparison   of  Three   Methods   to   Solve   Tridiagonal  Matrix  Equa- 
tions  for  Dirichlet's  or  Neumann's  boundary   conditions    ...    52 

Table  9.  Preparation  of  the   Data  for  the  Fast   Fourier  Transform   ...  56 

Table  10.  The  Complex  Fast   Fourier  Transform 57 

Table  11.  Final   Step  in  Cooley's    Fourier  Analysis   and  Synthesis.    ...  58 

Table  12.  CRED 59 

Table  13.  A  Comparison  of  Methods  with  Respect  to  Time 72 

Table  lU.  Assignment  Statement  Notation A-2 

Table  15 .  Assumptions  in  Timing A-2 


LIST  OF  FIGURES 

Page 

Figure  1.      The  Mesh ■ 2 

Figure  2.      Straight   Storage  of  Row  j   of  U 17 

Figure   3.      Odd-Even  Storage  Data  Allocation  of  Row  j   of  U  Where 

m  is   Even 21 

Figure  h.      A  96  "by  96  Mesh  in  Skewed  Storage 34 

Figure  5.   Storage  Schemes  Used  on  Rows  k   and  5  of  a  96  by  96  Mesh 

if  Odd/Even  Reduction  is  Used 48 


Figure  6.   A  Disk  Map  of  512  by  512  Meshes 


TO 


1.   POISSON'S  EQUATION  IN  MATRIX  FORM 

We  will  consider  Poisson's  equation  with  Dirichlet  boundary  condi- 
tions defined  on  a  rectangular  region.   This  equation  can  be  written  as 

t(X,T)  =  ^M   +  ^%^  (1) 

dY2        ox2 

where  the  value  of  U  is  given  at  the  boundaries  X  =  0,  Y  =  0,  X  =  (m-l)h 


x 


and  Y  =  (n-l)h    We  denote  U((i-l)h  ,  (d-l)h  )  by  U     t('(i-l)h  ,  (j-l)h  ) 
by  \|/ .  .  and  error  of  order  p  by  0(p).   By  dividing  the  region  into  mesh 

points  such  that  the  distance  between  mesh  points  in  the  vertical  direction 

is  h  and  in  the  horizontal  direction  is  h  we  get  Figure  l.a.   The  region 
y  x 

in  Figure  l.a  can  be  rotated  90  clockwise  to  obtain  the  mesh  U  in  Figure 

l.b  where  U.  .  is  the  mesh  point  in  the  i —  row  and  the  ,i —  column. 
1*3 


By  use  of  Taylor's  Theorem  we  obtain: 

2  3  k     1 

h     >2tt  h     ^3TT  n     ^k„ 
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a)  The  rectangular  region  with  the  mesh  points. 
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b)  The  mesh  points  with  separation  of . boundary  and  interior  mesh  points, 


Figure  1.   The  Mesh 


82U   ,.     . 
Thus 


I   (i^}  =     "5   ^Ui    i  +  l    +  Ui     i    1    "   2Ui    i)    +   °(hv)  (2) 

<=  -u^      i,o+i       1,0-1         i,o  y 


5Y 


and    £-§(!,;))   .    i  (U  +  U  -  OTJ     )  +  O(^)  (3) 


85T  hx 

Substituting   (2)   and   (3)   into   (l)   we  obtain 

%      (U.     .    ,    +  U.     .    ,    -  2U.     .)   +  ■—   (U.    .     .  +  U.    n     •    -   2U.     -)  ss  \|r.     .  •      (if) 
h2      v    i,J+l  1,3-1  1,0  h2        1+1,0  1-1,3  i,J  1,0 

y  x 

This  is  the  equation  for  any  interior  mesh  point  U.  ..   The  values  of  the 

1,0 

mesh  points  on  the  boundary  are  given,  and  require  no  equations.   Let  us 

take  all  the  equations  for  the  interior  mesh  points  and  place  the  constant 

terms  on  the  right  hand  side  of  the  equations.   For  example,  the  equation 

for  tL  .  where  3  <  o  <  n-2  will  be 
2, J       J  _  ^  _ 


y  xx 


The  equation  for  Up  „  will  be 

P   (U2,3  -  2U2,2'   +  k   CU3,2   "  Z\2]   ~  *2,2   "  k  V  "  72     U2,l 

y  X  .  X  y 


Combining  the  constant  terms  into  one  term,  M.  .,  for  each  U.  .  we  obtain 

i>J  1,3 

;.  Mi,j  "  *i,J  "  h   (Ui,o+l  W  +  Ul,o-l  W5  "  h  1*1+1,3   W 

y  x 

« 

!  +Vi/i.i,i)  ^ 

i 

where   fi     _  {  0  if  k  +  i 
where   b^£     -      (  x  lf  k  „  ^  • 


Consider  the  internal  mesh  points  as  a  vector  U_  in  which- the  order  of 
the  mesh  points  is  as  follows: 

1)  The  first  interior  point  in  the  first  row  becomes  the  first 
element  of  the  vector. 

2)  All  interior  points  of  row  i  are  placed  in  the  vector  in  the 
same  order  they  appear  in  row  i. 

3)  The  (n-l) —  element  of  the  i —  row  is  followed  by  the  first 
interior  point  of  the  following  row,  where  2  <  i  <  m_2. 

The  transpose  of  U-j-  equals 


(U2,2>  U2,3'  '."'  U2,n-i;  U3,2'  U3,3'  '"'      \a-V  ' 


'     m-1,2  m-1,3' 


U 


m-l,n-l) 


Using  this  order  we  assemble  the  interior  mesh  point  equations  and  consider 
them  as  the  matrix  equation  AU  =  M  -where  M  is  the  vector  of  constant  terms 
and  A  is  a  matrix  made  up  of  the  coefficients  of  the  mesh  points. 


The  matrix  A  is  a  (m-2)(n_2)  by  (m-2)(n-2)  block  tridiagonal  matrix 
of  the  following  form: 


A  = 


B    C 

C    B    C 

C    B 


C    B 
G 


1 


(6) 


where  C  =  al 


h-2' 


+1 

2 


h 


X 


and  B  is  a  (n-2)  by  (n-2)  tridiagonal  matrix  of  the  form 


"^ 


d     e     d 

\       s 

\      \      \ 
\      \       \ 

\   \    v 
x    \   \ 

d      e     d 


(6.1) 


J 


■where  d  =  —  and  e  =  -  (2a  +  2d), 
h 

y 


There  are  two  general  approaches  used  in  solving  this  system  of  linear 
equations,  direct  methods  and  iterative  methods.*  In  this  study  we  have 
limited  our  discussion  to  three  popular  iterative  methods  and  one  direct 
method:   the  successive  over-relaxation  method  (SOR),  the  successive  line 
over-relaxation  method  (SLOR),  the  alternating  direction  implicit  method 
(ADI),  and  Hockney's  direct  method  (FACR).    In  the  discussions  of  FACR  three 
types  of  boundary  conditions  will  be  examined:   periodic,  Neumann  and  Dirichlet. 
j  The  discussion  of  iterative  methods  considers  only  Dirichlet  boundary  conditions , 


No  matter  which  method  is  used  in  solving  the  equations  the  lower  bound 
on  the  error  is  0  (max(h2,  h2)).   This  is  due  to  the  error  terms  in  (2) 


and  (3) 


y  x' 


2.   IMPLEMENTATION  OF  ITERATIVE  METHODS  ON  ILLIAC  IV 
2.1   Introduction  to  SLOR  and  SOR 

Iterative  methods  have  one  thing  in  common.  They  each  begin  with 
an  initial  guess  to  the  solution  and  improve  upon  the  guess  with  each 
iteration.   Thus  the  following  notation  is  useful:   The  value  of  UL  after 
the  &21     iteration  is  denoted  by  u|  ^  .  U^j  is  the  initial  guess  and 

if  the  method  converges  then  Lim  (U    -  U      )  =  0. 

Matrix  A  can  be  divided  into  three  matrices  such  that  A  =  D  -  E  -  F 
where  D  is  block  diagonal,  E  is  strictly  lower  triangular  and  F  is  strictly 
upper  triangular.   Thus  AU  =  M  can  be  written  as  DILj.  -  (E  +  P)  llj.  =  M. 

From  this  we  get  the  iterative  method: 

U^+l)  =  D"1  (E  +  F)  TjW      +   if1  M  (7) 

The  matrix  equation  can  also  be  grouped  to  obtain 

U(i+l)   =  (D-E)-1  FU^}  +  (D-E)"1  M  (8) 

The  asymptotic  convergence  rate  of   (8)   is  twice  as   fast   as    (7)    [Todd, 
Page   391 J  •      This   is   due  to  the   fact  that  the  matrix  has   Young's  property  A 

[Wachspress,    Page  102].       (8)    is   a  successive  method  while    (7)   is   a 

** 

simultaneous  method. 

By  defining  J  =  —  (D  -  u>E)    and  H  =  —  (wF  +   (l-w)D)    for  u^Ove  get 
to  to 

u  JU-j.   =   u)HU     +   oM  from  AU     =  M  =    (J-H)U   . 


*   Improvement  is  measured  by  a  reduction  in  the  error  vector. 

**  Successive  methods  use  new  information  as  soon  as  it  is  available 
while  simultaneous  methods  use  old  values  for  the  entire  iteration. 


We   can  write  wJU     =  wHU     +  wM  as 


(D-wE)U. 


(ihl)   =    (goF  +    (l-w)D)   \j[£)   + 


ojM 


(9) 


Note  that    (8)   and   (9)   are   equivalent   if  id  =  1.      When  0  <   w   <   2  and  A  is 
positive   definite,    (9)    converges.      If  1<  oo  s.2,   then    (9)    is   referred  to 
as   an  over-relaxation  method   [Forsythe  and  Wasow,   Page  26l ] .      With 

different   values   of  io,    the  convergence   rate  of   (9)    is   altered.      For  a 

2 
single   relaxation  parameter,    it    can  be   shown  that  to     =  ;    —       , 

b   1  +/l-p(3) 

where  3  =  D   (E+F),  the  Jacobi  matrix,  and  p(8)  is  the  spectral  radius 

of  3,  gives  the  fastest  rate  of  convergence  [Young,  Page  169 ] . 

There  are  two  common  ways  in  which  equations  (7  -  9)  are  implemented  • 
The  first  gives  point  iterative  methods  by  letting  D  equal  the  diagonal  of 
A.   Thus  (7)  becomes  Jacobi 's  method,  (8)  becomes  the  formula  for  succes- 
sive point  relaxation  iterative  method  (SR)  and  (9)  becomes  the  successive 
point  over  relaxation  iterative  method  (SOR). 

If  we  let 


D  = 


where  B  is  defined  in  (6.1)  then  we  will  get  line  iterative  methods. 
(7)  becomes  the  simultaneous  line  iterative  method,  (8)  becomes  the 


successive  line  relaxation  iterative  method  (SLR)  and  (9)  becomes  the 
successive  line  over-relaxation  method  (SLOR)  [Varga,  Page  199] . 

To  give  the  reader  a  better  understanding  of  the  way  these  methods 

work,  we  write  the  individual  equations  for  the  interior  mesh  points 

U.  .,  where  3  <  i  <  m-2  and  3  <  0  <  n-2  for  each  method.   First  we  de- 
1, 0 

fine  h,  v,  and  g  as  follows: 

^2  .2  u2    v2 

h  h  h     h 

h  =  oY     o     ,         v  =  p2 o  >  and  g  =  -— Z_  ' 

2(h2  +  h2)  2(h2  +  h2)  2(h2  +  h2) 

v  x    y  x  x    y  x    yy 


Jacobi  's  method  (simultaneous  point  iterative  method) 

„U+1)                /..(i)     TT(i)   v  .  ,TT(i)     TT(i)   V 

U:  .    =  v  (U:  .  ,  +  U.  .  , )  +  h  (U.  t  .  +  U.  t  .)  -  g  M. 

1,0       1,0+1   1,0-1  v  1+1,0   1-1,0      1,. 


Simultaneous   line  iterative  method 

TT(i+l)                    ,TTU+l)             tt(^+1)n       i.     /tt(^)  ttU)        N 

U.    .       =    v  (U:    .  i     +  U.    .   ' )+  h  (U:  : .  +  U.    '    .)  -     g  M. 

1,0                    1,0+1          1,0-1              1+1,0  i-l,0                 i,. 


Successive  point  relaxation  method  (SR) 


(10) 


Note  we  have  excluded  the  first  row,  last  row,  first  column,  and  the 
last  column  of  the  interior  mesh  points  to  avoid  boundary  conditions. 
These  points  will  be  discussed  in  another  section. 


Successive  line  relaxation  method  (SLR) 

(Ml),  .  T  (u(i+D    +    u^D,  +  h  (UU)     +  ^1),  .     M.  .  .  (li: 

i*3  1*3+1      1*3-1        1+1*0    1-1*3        1*3 


Successive  point  over-relaxation  (SOR) 

u(i+1)  .  ,»  v  (u^.  ,  +  a**1])  +  »  h  (u^   .    +  u";1?)  .,,«.   . 

1,3  1,3+1  1,3-1  1+1,3  1-1,3  1,3 

+    ^      "if]'  •  (12) 


Successive   line  over-relaxation   (SLOR) 

TT(i+l)  /TTU+l)        TT(i+lK  t,    /TT(^)  TT(<0+1)\ 

U.     .        =     v   (U.     .    '  +   U.     .    '     +  w   h     U.    '     .     +  U:    .     . )    -  w    g  M.     . 

i*3  i*3+l  1*3-1  1+1*3  1-1*3  i*3 

+  (1-00)   (u^)  _      v      (u^)      +  u^)     )).  (13) 

1*3  1*3+1  1*3-1 


2 . 2  Introduction  to  API 

Let  us  look  at  the  equation  we  are  solving  once  again, 

ivi^ji  +  i^M±  _  T  (  fl  (1) 

as  before  we  use  Taylor'  s   Theorem  and  we  get 

y 


and 


sfu 

2 
&X  h 

x 


(1'^72      (Ui-l,j  +    Ui+l,j    -   2Ui,^   +   ^  (3) 


10 


Thus  when  we  form  the  matrix  equation  AIL.  =  M  where  A  is  the  matrix  (6), 
we  are  considering  a  horizontal  part  of  the  equation  and  a  vertical  part 
of  the  equation.  We  could  divide  A  into  two  matrices  H  and  V  such  that 
H  +  V  =  A  where  V  is  the  matrix  containing  the  coefficients  of  the  vertical 
equations  (2).   Both  H  and  V  can  be  placed  in  tridiagonal  form  with  the  ap- 
propriate permutation  matrix. 

We  can  take  the  equation 

AU  =  HU  +  VU  =  M  and  obtain 

(H  -  ^jjDUj  =  M  -  (V  +  w^l)^  and 
(V  -  o^DUj  =  M  -  (H  +  0)^1)1^  • 

Thus  we  can  get  the  Peaceman-Rachford  AD I  scheme  [Varga,  Page  212]. 
uW^h.^-I^.^^^  (lU.l) 

Ux(i+1)  =  (V  -  u^l)-1  [M  -  (H  f  a)ivl)U^+1/2)]  (14.2) 

The  parameters,  u^  and  u^y  are  defined  in  [Wachspress,  Page  192]. 

It  will  help  to  have  the  corresponding  equations  for  the  individual 
mesh  points,  IL  .,  where  3  <  i  <  n-2  and  3  <  j  <  m-2. 

(^1/2)    _J _1    (i)      (J)  2_   (!) 

m     h2  y  y 

x 

_  _l  (u(ifl/2)   uU+1/2),,  ; 

h2  lui-l,j   +  ui+i,j  ".•  (15.D 

x 


ADI  requires  that  A  be  positive  definite.      This   can  be  obtained  by  using 
-AUj  =  "M  or  by  using  -u)£V  and  -co       instead  of  to  y  and  co      .     We  use  the 
latter  technique. 
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u(i+l)  =    -1 


[M  -J:   (u(i+1/2)   +    U(i+V2))    _    ,  _  _2,       (if  1/2) 

_2>       LMi,J        h2    lUi+l,j        +    Ui-l,J      >    "    ^WiV  9j    Ui,j 


(^y+-f)     X'J    h^    ^^       ^^  ^        hx 

y 

y 

Examining  (15-1)  and  (15.2)  we  note  a  strong  resemblance  to  simultaneous 
column  relaxation  followed  by  simultaneous  row  relaxation. 

2 . 3  Implementation  of  SOR  on  ILLIAC  IV 

2.3.1  A  Parallel  Processor's  Effect  on  the  Algorithm 

A  parallel  machine  allows  operations  to  be  performed  simultaneously 
which  are  done  in  separate  time  intervals  on  a  conventional  machine.   This 
is  referred  to  as  overlap.   The  amount  of  overlap  is  dependent  on  the  way 
in  which  the  algorithm  is  programmed.   The  SOR  algorithm  supplies  an  il- 
lustration of  how  the  flow  of  an  algorithm  is  changed  when  programmed  to 
maximize  overlap  on  a  parallel  machine. 

Let  us  examine  the  equation  for  SOR  closer: 

^^-^i^O^-^^^--   ^       (12) 

+  (1K0)  U{£\ 

for  3  <  i  <  m-2  and  3  <  j  <  n-2.   Note  the  interior  points  which  are  adja- 
cent to  one  or  more  boundary  points  are  not  included.   Also  note  that  for 

the  above  mesh  points  M   .  =  Y   ..   The  equation  for  U    would  be 

1  >3  1>J  2,2 


where 


M    =  W u    -     u 

2,2   X2,2   h2   Ul,2   ^2  U2,l 


x 


hy 


Thus 


M2,2  "  ■  *2)2  "  *   U2,l  "  h  V2  • 
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This  enables  us  to  rewrite  (l6)    as 


u 


17  - * v  (%3 +  $z ,]  +  u  h  <%k +  ui;s  '  - u  s  ¥2,2         ^ 


To  calculate  U    using  (17)  takes  two  more  additions  than  using  (.16) 
2,2 

but  M    is  needed  to  use  (l6).   On  a  serial  machine,  one  would  use 

equation  (16)  to  improve  U   .   The  choice  is  not  so  straightforward  on 

2>2 

a  parallel  machine  if  when  U    is  calculated  in  one  processing  element 

2>2 

(PE),  U.  .  is  calculated  in  another  PE  where  3  <  i  <m-2  and  (or) 

3  <  j  <  n-2.   In  this  case  to  calculate  U    using  (.16)  we  would  have  to 

2;  2 

turn  the  appropriate  PE  off  for  two  additions.  By  using  (.17),  we  woaild 

not  have  to  turn  off  PEs  for  those  additions  and  would  also  save  the 

calculation  of  M   .   A  similar  argument  can  be  given  for  all  points 
2,2 

adjacent  to  boundary  points.   Thus  the  choice  of  the  equation  t©  evaluate  a 
point  adjacent  to  the  boundary  in  one  PE  depends  on  what  is  being  done 
in  the  other  PEs. 


2.3.2  Parallelisms  in  SOR 

Examining  equation  12,  we  note  that  the  values  of  U.  -    .  and  U.  .  ; 

l-l, j      1,0-1 

are  needed  to  evaluate  U.  .   .   Now  consider  the  following  mesh: 

i,J 

0  0       0          0          0          0 

0  12          3          1+0 

0  2        3^50 

0  3        4          5          6          0 

0  k       5         6         7         0 

0  0       0          0          0          0 


Whenever  U.     .    is  a  boundary  point  U.    \   =  U.     .    ^for  all  £. 
x>3  i,J  i,J 
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0  indicates  a  boundary  point.   The  interior  mesh  points  are  divided  into 
seven  sets  (l,2,.3'»«>  7)  where  set  i  contains  all  points  labeled  by  i. 
To  improve  any  point  in  set  1,   we  must  first  improve  all  the  points  in 
the  union  of  all  sets  j  such  that  j  <  i. 

In  a  computer  with  k   PEs,  one  iteration  could  be  completed  in  7  time 
steps  where  a  time  step  is  the  amount  of  time  to  improve  a  mesh  point  in 
one  PE  by  improving  one  set  in  each  time  step.   But  the  computer  could  im- 
prove 28  points  in  7  time  steps.   To  improve  the  efficiency  of  the  method, 
we  look  at  what  could  be  done  in  each  time  step  assuming  we  had  done 
everything  possible  up  to  that  time  step. 


Time   Step 

1 

2 

3 

k 

5 

6 

7 

8 

9 

10 

11 

What 

r1 

21 

31 

k1 

51 

61 

71 

Can  Be 

I2 

2 
2 

32 

k2 

2 
5 

62 

72 

Done  At 

I3 

23 

33 

U3 

53 

63 

73 

That  Time 

1* 

2U 

3U 
I5 

25 

5U 
35 

x6i 

1 

Table  1.   Parallelism  of  SOR 

£  th 

i   denotes  the  i —  improvement  of  the  elements  in  set  i. 


In  Table  i  We  see  that  steps  1  to  5  improve  less  than  8  elements  but 
8  elements  are  improved  at  each  time  step  following  step  5.   Thus  if  we 


Ik 


have  a  computer  with  8  PEs  we  can  compute  U    in  5  +  2^  time  steps.   If 
we  assume  that  our  initial  guess  is  1  ,  2  ,'  3  ,  ^  ,  5  >   6  ,  7  ,  instead 
of  1,2,3,^,5,6,7   then  we  can  start  at  step  6  and  do  every 
iteration  in  2  time  steps. 


This  can  be  generalized  for  an  arbitrary  mesh  as  follows:   First  we 

group  all  the  interior  mesh  points  into  two  sets, 

ODD  =  {U.  .   I  i  +  j  is  odd},    EVEN  =  {U.  .  |  i  +j  is  even} 
i,  0  i,0 

Time  step  1  improves  all  the  points  in  EVEN.   Time  Step  2  improves 
all  the  points  in  ODD.  We  will  call  this  method  modified  successive  over- 
relaxation  (MSOR). 

Remember  all  the  points  in  ODD  can  be  calculated  using  only  points 
from  EVEN  and  all  the  points  in  EVEN  can  be  calculated  using  only  points 
from  ODD.   Thus  we  can  write  a  two  part  algorithm  for  MSOR  [Young,  Page  271]. 

U<i+1)  .„.,  (v  (4.1\    n  +  U<«  ,)  +  h  (S\    .   +  U^  .)  -  g  f.  .)        | 

1,0      Ei      1,0+1    1,0-1        1-1,0     1+1,0       1,0 


r(i) 

i,0 

for  all  the  points  in  EVEN 


+  (1^ei>  ui,j  (18.1; 


1,0  Di  v    1,0+1  1,0-1  1,0+1  1,0-1  1,0 


r(i) 

i,0 


+    (l^Di}    UL  (18.2; 


for  all  the  points   in  ODD. 
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th 
In  Young's  discussion  of  MSOR,  the  optimal  w  ,  and  j)  ,  for  the  £ 

iteration  are  calculated.   We  will  limit  our  study  to  MSOR  where  to  . 


Intuitively  one  would  think  MSOR  and  SOR  have  very  close  rates  of  con- 

at 

vergence.   In  fact,  their  "asymptotic  rates  of  convergence"  are  equivalent. 

If  A  is  a  convergent  n  x  n  complex  matrix,  for  all  £   sufficiently  large,  the 

£  £ 

average  rate  of  convergence  for  £   iterations,  R(A  ),  is  Lim  R(A  )  = 

£  -»  °° 
-lnp(A)  s   R   (A)  [Varga,  Page  67].   Thus  for  a  sufficiently  large  £,    the 

asymptotic  rate  of  convergence  equals  the  average  rate  of  convergence.   We 
know  if  P  is  a  permutation  matrix  that  p(PAP   )  =  p(A)  [  Birkhof f  and 
Maclane,  Page  2^9].   By  multiplying  (9)  "by  P  we  get 

puU+l)  =  p  £,_<£)-!  tcoF  +  (lMjJ)D]  p_1pu(i)  (19) 


The  asymptotic  rate  of  convergence  for  (19)  equals 
-Ln  p(p(DaJjE)"1  (wP  +(i^))D}p-1)  =  -Lnp(  (D^oE)"1  {wFf  (l-w)D))  (20) 

which  equals  the  asymptotic  rate  of  convergence  for  SOR. 

This  not  only  shows  that  SOR  and  MSOR  have  the  same  asymptotic  convergence 
rate,  but  that  SOR  applied  to  any  permutation  of  U  gives  the  same  asymptotic  rate 


The  negative  of  the  logarithm  of  the  spectral  radius  of  a  convergent  matrix 
A  is  the  asymptotic  rate  of  convergence,  R  (A),  for  the  matrix  A  [Varga, 
Page  67 1. 
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of  convergence.   Thus  we  can  choose  the  ordering  of  U  which  best  fits  the 
machine  being  used. 


2.3.3  SOR  using  Straight  Storage 

Now  we  shall  examine  the  implementation  of  SOR  on  ILLIAC  IV  [Rudsinski]. 
We  will  compare  two  storage  schemes  and  examine  their  effect  on  machine  effi- 
ciency  and  their  time  per  execution. 

The  first  storage  scheme,  Straight  Storage,  stores  all  the  elements  of 
row  i  before  any  of  the  elements  of  row  j  if  i  <  j .   The  elements  of  each 
row  appear  in  the  order  they  are  found  in  the  mesh.   The  first  element  of 
each  row  is  stored  in  PE  0  and  the  last  element  of  the  row  is  stored  in 
PE  I  where  I  equals  (n-l)  mod  6k,   n  is  the  number  of  columns.   Figure  2  should 
help  clarify  this  storage  scheme. 

How  can  we  implement  SOR  using  Straight  Storage?  We  observe  that  the 
PEs  which  contain  the  odd  elements  of  mesh  row  I  contain  the  even  elements 
of  mesh  row  I  +  1.   This  enables  us  to  improve  the  odd  elements  of  Rows  I  and 
I  +1  simultaneously  by  using  a  PE  integer  index  which  accesses  an  element 


* 

The  timing  method  is  explained  in  Appendix  A.   The  terms  used  in  timing 

are  defined  there  also. 


IT 


Row  U, 


Row  U 


3+1 


Row  U 


j+K-1 


Row  U 


J+K 


PEM  0 


V 


PEM   1 


PEM  I 


U. 


D,65 


u 


3,     n  -I 


U 


j+1,1 


• 

u. 

,2 

u. 
0 

,66 

• 

u. 

0 

,n+l- 

-I 

u. 

3- 

.1,2 

• 

• 

U. 

,1+1 

U. 
3 

,65+1 

• 

U. 
J 

>n 

u. 
J- 

•-1,1+1 

1 

PEM  63 

# 

UJ,64 

uo,i28  ; 

i 

! 

i 

U.    ,    c\ 

a+1,64 

• 

Figure  2.   Straight  Storage  of  Row  j  of  U 


in  row  I  (i+l)  in  the  even  (odd)  PEs  when  I  is  even.   In  a  like  manner  we 
can  access  all  the  even  elements  of  two  consecutive  rows.   If  we  have  an 
odd  number  of  rows  we  can  improve  the  odd  elements  of  the  last  row  and  the 
even  elements  of  the  first  row  simultaneously  using  a  similar  technique. 

Given  a  mesh  with  m  rows  and  n  columns,  we  use  the  index  described 
above  so  that  we  can  improve  two  rows  simultaneously.   Each  pair  of  rows  can 


K  stands  for  the  number  of  rows  of  PEM  required  to  store  n  words  and 
I  =  (n-1)  mod  6U. 
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be  divided  into  K  groups  where  K  =  Ln/6UJ   if  n  mod  6k   =   0  otherwise 
K  =  [n/6i+J  +  1.   Using  Phase  1  we  can  improve  6U  elements  at  a  time 

(n-2)/6UJ  times  and  then  use  Phase  2  to  improve  the  remaining  interior 
points.    If  we  use  Phase  1  to  improve  the  first  6U  odd  (even)  interior 
points  of  a  pair  of  lines  then  we  must  take  into  account  that  only  63  of 
them  are  found  in  the  same  pair  of  rows  of  PEM  because  of  the  boundary 
points.   This  means  we  must  use  special  indices  to  reach  the  6Uth  odd 
(even)  interior  point.   This  requires  no  extra  computer  time  because  the 
number  of  times  register  X  is  changed  is  not  affected.   SOR  requires  not 
only  the  point  to  be  improved  but  its  four  neighbors  and  the  appropriate 
value  of  Y. 

For  example:   To  improve  U.  .  we  need  U.    .,   U    .,   U.  ,  _,  U.  .  , , 

U     and  \|r 

These  values  will  be  accessed  by  a  combination  of  ACAR  and  Register  X 
indexing.   To  do  the  indexing,  K  will  be  combined  with  the  CU  integer  Cl-equal 
to  the  first  row  of  PEM  containing  mesh  points  to  be  improved  during  this 
execution  -  and  three  PE  integers. 

PI=(1,K,0,K,0,...,0,K),   PR=(K,0,K,0,...,K,0,)  and  PL=(K+1, 1,K,0,K,0, . . . ,K,0) 
if  we  are  improving  the  odd  points; 

PI=(K,1,0,K,0,K,...,K,0),  PR=(0,K,0,K,...,0,K)  and  PL=(l,K+l,0,K,0,K, . . . ,0,K) 
if  we  are  improving  the  even  points; 


*  [pj  equals  k  where  p  is  a  real  number,  k  is  an  integer,  and  k<  p  <  k-fl. 
**  Phase  1  and  Phase  2  are  defined  in  Appendix  D. 
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INITIAL  CONDITIONS 

Register 

Contents 

Register 

Contents 

$  X 

PI 

$  D3 

WI 

$  DO 

H 

$  CO 

CI 

$  Dl 

V 

$  CI 

CM 

$  D2 

W 

$  C2 

CP 

STEP 

ASSIGNMENT 

OPERATIONS  AND  OVERLAP 

TIME  IK 

NO. 

STATEMENTS 

CLOCKS 

1 

$A:  =  U[$X]  (0) 

PE  fetch 

7 

2 

$C3:  =  $D3;  $A:  =  $A*$C3 

CU  fetch 

Real  multiplication 

10 

3 

$S:  =  $A  -  \|f[$X]  (0) 

Real  subtraction 

(PE  fetch  overlapped  by- 

multiplication) 

7 

k 

$A:  =  U[$X]  (2) 

(PE  fetch  overlapped  by- 

addition) 

0 

5 

$A:  =  $A  +  U[$X]  (1) 

Real  addition  and  PE  fetch 

ik 

6 

$X:  =  PL 

(PE  fetch  overlapped  by  addition) 

0 

7 

$C3:  =  $D0;  $A:  =  $A*$C3 

CU  fetch 

Real  multiplication 

10 

8 

$R:  =  U[$X]  (0) 

(PE  fetch  overlapped  by 

multiplication) 

0 

9 

$A:  =  $A+  $S 

Real  addition 

7 

10 

$X:  =  PR 

(PE  fetch  overlapped  by  addition) 

0 

11 

$B:  =  RTL  (1,,$R) 

Route 

3 

12 

$R:  =  U[$X]  (0) 

(PE  fetch  partially  overlapped 

by  Route) 

1+ 

13 

$S:  =  $A 

Load  from  PE  Register 

1 

Ik 

$A:  =  RTR  (1,,$R) 

Route 

3 

15 

$A:  =  $A  +  $B 

Real  addition 

7 

16 

$C3:  =  $Dlj  $A:  =  $A*$C3 

(CU  fetch  overlapped  by  addition) 

Real  multiplication 

9 

17 

$X:  =  PI 

(PE  fetch  overlapped  by 

multiplication) 

0 

18 

$A:  =  $A  +  $S 

R'^al  addition 

7 

19 

$C3:  =  $D2;  $A:  =  $A*$C3 

(CU  fetch  overlapped  by  addition) 

Real  multiplication 

9 

20 

U[$X]  (0):  =  $A 

PE  store 

7 

TOTAL  TIME 

105 

Table  2.   Timing  of  an  Execution  of  Phase  1  SOR  using  Straight  Storage 
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Let  W  =  wg,  V  =  v/g,  H  =  h/g,  and  Wl  =  -~  where  g,  h,  and  v  are  defined 

in  1.2. 

Now  (IT)  can  be  written  in  GLYPNIR  as  follows: 

CP:  =  C  +  K; 

LOOP  CI:  =  C,  K  +  K,  CK  DO  BEGIN 

CM:    =  CI  -   K; 

CP:    =  CI  +  K; 

(21) 
U   [PI   +    CI]:    =  W*(V*(RTR(1,,U[PR+CT])+  RTL(l,,U[PL+Cl] ) )      +   H*(U[PI+CM] 

-  U[PI+CP])    -\|r[PI+Cl]    +     Wl*U[PI+Cl]    )', 

END; 

By  placing  CI  in  ACARO,,  CM  in  ACAR1,  CP  in  ACAR2,  H  in  $00,  V  in  $D1,  W 

in  $D2,  Wl  in  $D3  and  PI  in  $X  (21)  could  be  translated  into  ASK  as  outlined 

in  Table  2. 

2.3.^  SOR  using  Odd-Even  Storage 

Now  we  will  look  at  another  storage  scheme  to  see  if  we  can  improve 
the  data  access  time.   Given  an  m  by  n  mesh  divide  each  row  into  k  +  1  segments 
such  that  k  segments  are  128  mesh  points  long  and  the  last  segment  is  t   points 
long,  where  0  <  &   <  128.  Use  two  adjacent  lines  of  PEM  to  store  each  segment. 
Store  all  the  red  (black)  points    of  the  segment  in  the  first  (second)  line 
of  PEM.  The  segments  appear  in  the  same  relative  order  as  they  appeared  in 
Straight  Storage.   This  storage  scheme  will  be  called  Odd-Even  Storage. 
For  an  example  see  Figure  3. 


Appendix  A  explains  the  notation  and  timing  method  used  in  Table  h. 

A  point  that  appears  in  an  odd (even)  numbered  column  is  considered  a 
red  (black)  point. 
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If  H  <  6k,    then  Straight  Storage  requires  m  brj   lines  of  PEM  while  Odd- 
Even  Storage  required  m  (  h£-|  +l)  lines  of  PEM.   If  £   >  6k   then  both  Straight 


Storage  and  Odd-Even  Storage  require  m 


VI 


lines  of  PEM. 


PEM  0 


Row  U. 


Row  U 


J+l 


Row  U 


j+K-2 


Row  U. 

J+K-l 


h,2 


u 


j,n-.0+lj 


"^n-i+d 


PEM 

1 

!  : 

1  j 

,3 

V. 

,h 

1 
i 
i 
i 

! 
i 

lu. 

>n- 

-i+3 

U. 

,n- 

■£+k 

1 
i 

*            1 

PEM  I 


U 


j,n-l 


PEM  63 

• 

u. 

,127 

u. 

,128 

; 





1 

• 

Figure  3-   Odd-Even  Storage  Data  Allocation  of  Row  j  of  U  Where  n  is  Even. 


pi  =  k  where  k  is  an  integer  such  that  p  <  k  <  p  +  1. 
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INITIAL  CONDITIONS 

Register 

Contents 

Register 

Contents 

$  X 

PI 

$  D3 

Wl 

$  DO 

H 

$  CO 

CI 

$  Dl 

V 

$  CI 

CM 

$  D2 

W 

$  C2 

CP 

STEP 
NO. 

ASSIGNMENT 
STATEMENTS 

OPERATIONS  AND  OVERLAP 

TIME  IK 
CLOCKS 

1 
2 

$A:  =  U[$X]  (0) 
$C3:  =  $D3;  $A:  = 

=  $A*$C3 

PE  fetch 
CU  fetch 
Real  multiplication 

7 
10 

3 

$S:  =  $A  -  \|f[$X] 

(o) 

(PE  fetch  overlapped  with  mult.) 
Real  subtraction 

7 

h 

$C3:  =  $C0  +  1 

(CU  operations  overlapped) 

0 

5 

$R:  =  RTL  (1,,U(3)) 

(PE  fetch  overlapped  by  addition) 
Route 

3 

6 

$A:  =  $R  +  U[$X](3) 

(PE  fetch  partially  overlapped  by 
Route)  Real  addition 

11 

7 

$C3:  =  $D1;  $A:  = 

=  $A*$C3 

(CU  fetch  overlapped  by  addition) 
Real  multiplication 

9 

8 

$R:  =  U[$X](1) 

(PE  fetch  overlapped  by  multiplication' 

0 

9 

$S:  =  $A  +  $S 

Real  addition 

7 

10 

$A:  =  $U[$X]  (2) 

+  $R 

(PE  fetch  overlapped  by  addition) 
Real  addition 

7 

11 

$C3:  =  $D0;  $A:  = 

=  $A*$C3 

(CU  fetch  overlapped  by  addition) 
Real  multiplication 

9 

12 

$A:  =  $A  +  $S 

Real  addition 

7 

13 

$C3:  =  $D2;  $A:  = 

=  $A^$C3 

(CU  fetch  overlapped  by  addition) 
Real  multiplication 

9 

Ik 

U[$X]  (0):  =  $A 

PE  store 

7 

TOTAL  TIME 

93 

Table  3-   Timing  of  an  Execution  of  Phase  1  SOR  using  Odd-Even  Storage 
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Phase  1  of  the  implementation  of  SOR  using  Odd-Even  Storage  requires 
only  one  PE  index  and  one  route.   We  can  access  the  data  using  PI  =  (2,0,0,0. . ,0) 
and  CI  equal  the  first  row  of  PEM  containing  mesh  points  to  be  improved  during 
this  execution.   We  need  two  assignment  statements:   (22.1)  for  the  red 
points,  and  (22.2)  for  the  black  points. 

Using  K,  W,  V,  H  and  Wl  are  defined  in  2.2-3,  SOR  using  Odd-Even  Storage 
can  be  written  in  GLYPNIR  as  follows : 

LOOP  CI:  =  C,  K,  CK  DO  BEGIN 

CM:  =  CI  -  K; 

CP:  =  CI  +  K; 

U[PI  +  CI]:  =  W*(V*(R1R(1,,  U[CI+1])  +  U[PI  +  CI  +  l]) 
+  H*(U[PI  +  CM]  +  U[PI  +  CP])  -  V[PI  +  CI]  +  WI*U[PI  +  CI]);    (22.1) 

U[CT]:  =  W*(V*(RTL(1,,U[PI  +  CI  -  l])  +  U[CI  -  l]) 

+  H*(U[CM]  +  U[CP])-  V[Cl]  +  W1*V[CI]);  (22.2) 

END; 

Table  3  suggests  how  (22.1)  could  be  coded  in  ASK.   (22.2)  requires 
different  code  but  the  logic  is  similar.    The  code  for  SOR  using  Odd -Even 
Storage  is  10$  faster  than  the  code  for  SOR  using  Straight  Storage.   This  is 
due  to  a  savings  in  memory  fetching  and  the  elimination  of  one  route. 

2.3-5   SOR  Machine  Efficiency 

If  a  mesh  has  m  rows  and  32  {6k)    columns  and  straight  (odd-even)  storage 
is  used,  machine  efficiency  can  be  doubled  by  placing  rows  1  to  -  in  PE's 


0  to  31  and  rows  —  +  1  to  m  in  PE's  31  to  63.   This  improved  storage  scheme 
requires  that  rows  ^-  and  r-+  1  be  handled  as  special  cases.   The  idea 
behind  this  technique  can  be  used  on  any  mesh  which  doesn't  have  a  multiple 


2U 


of  6U    (128)    columns  when  straight    (odd-even)    storage   is  used.      Implementation 
of  such  techniques  increases  machine  efficiency  and  decreases   storage  require- 
ments per  mesh,   but  also  increases  the  complexity  of  the  program. 

2.U      Implementation  of  SLOR  on   ILLIAC   IV 
2.U.1     Parallelism  in  SLOR 

In  2.3.2  the   scheme   for  SOR  was   altered  to  maximize  parallelism  on 
ILLIAC   IV.      In  a  similar  manner,   we  will  modify  the  scheme   for  SLOR  to  maxi- 
mize the  number  of  columns  which   can  be  improved  simultaneously.      In  2.1*. 2 
the  need  for  blocks  of  128  columns  to  perform  column  relaxation  with  optimal 
efficiency  on  ILLIAC   IV  will  be   explained.      This  leaves  no   restrictions  on 
the  number  of  rows   other  than  the  limits  of  available   core.      Analogous  to 
the   SOR  scheme,   note  that   column   i   can  be   improved  for  the   second  time 
after  column   i  +  1   is   improved  for  the   first  time.      By  taking  advantage 
of  this   fact  we  can  improve  half  of  the   columns    simultaneously  after  a 
finite  number  of  iterations.      Thus   once  we  reach  that    state  we   can  improve 
the  entire  mesh  in  two   steps      if  we  have  enough  PEs.      In  SOR  we   improve  the 
odd   (even)   points   and  then  the  even    (odd)   points.      In  SLOR  we  must   improve 
the  odd    (even)    columns   and  then  the  even    (odd)    columns.      The  proof  we 
used  in  2.1.2  to   show  that  we   could  improve  the  points   in  SOR  in   any  order 
can  be  extended  to   show  that  the   order  in  which  the  columns   are   improved 
SLOR  does   not    change  the  asymptotic  rate  of  convergence. 

U(Ul)   =  h(U(^»      ♦  U<\+1>  )  ♦  0>W>        +  D«>      ).  ug  f .    .  (23.1) 

1»J  1+1»J  1-1  »J  1,J+1  l,j-l  l.J 

/  X      /      (*)  /      (M  (M  u 

+  (1  -  w   (u:  :   -  h (u  j  .  +  u  ;  j 
i»j         1+1, 3      1-1,  j 

for  the  odd  columns   and 


u: 


for  the   even  columns 


-  htu!**1!      ♦  US4!1'  )  ♦  «r  (u!Ui»   ♦  o!^)-  -8  *.    .  (23.2) 

1-1, j  i,j+i        i,j-l'  i,j 

(A)  ,    (i)  (I) 

:  :    - hu J  ,  +  u :  !  , 
1  j  1+1,  j      i--i».i 


(1  _  u>)  (u:  :    -  h(u: .;   ,  +  u:  1  J) 
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2.U.2     Storage   for  SLOR 

Now  we  need  to   find  a  storage  scheme  which   is   efficient   on  ILLIAC  IV. 
If  we  use  Straight   Storage,   we  run  into  the   same   inefficiency  we  did  with 
that   storage   in   implementing  SOR.      We  need  to  use  three  PE  integer   indiees 
and  do  two  routes  per  mesh  point.      Furthermore,   the  method  works  on  blocks 
of  256   columns.      Column  J  and  column  J  +  1   cannot  he   improved  simultaneously, 
Thus  we   improve  the  odd   (even)   columns   in  the   set    {J|I  <  J  <   I  +  63}   and 
the   even    (odd)   columns    in  the   set    {j|l  +  128  <   J  <,   I  +  191).      If  we  use 
Odd-Even  Storage  we  can  eliminate  one   route  and  two  of  the  PE  indices.      We 
also   decrease  the  blocks  to  128  columns   each. 

2.U.3     Improving  a  Line  by  the  Cut hill -Varga  Method 

The  method  of  implementing  SLOR  must    solve   a  tridiagonal  matrix 
equation   in  each  PE.      Thomas'   method   [Todd,    Page  395]    is   commonly  used. 
A  method  by  Cuthill   and  Varga  can  eliminate   some  of  the  work  by  doing  some 
preconditioning  if  the  tridiagonal  matrix  is   positive  definite   and  real 
symmetric.      Even   if  the  mesh  spacings   are  not   constant,  Poisson's   equation 
satisfies  those  conditions    [Cuthill      and  Varga,    Page  2l+l]. 

We  will   apply  the  latter  method,   the  Cuthill -Varga  method,    as   follows* 
Because  we   are  performing  successive   column  over-relaxation  we  will  use   a 
permutation  of  A,    PAP      ,   to   discuss  the  technique.      The   equation  under 
consideration  will  be  -PAP~  PU     =   -PM. 
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-PAP 


-1 


B,    C 

■■ 

C,    B,    C 

C,    B,    C 

XVs- 

\      "»      \ 

C,    B,    C 

C,    B, 

c 

c, 

B  - 

(2k) 


where 


B  = 


e,  d 


d,  e,  d 


v 


d,  e,  d 
d,  e 


C  =  — I,  e  =  +2  l^+  —■),  d  =  ~? 

h2  h2   h2       hx 

y       \  y   x/ 


.th 


B  and  C  are  m-2  by  m-2  matrices.   Let  U.  and  NL  denote  the  i   column 


of  PU  and  -PM  respectively.   Now  using  (2k)   we  obtain 


BU.  =  M.  -  C(U.  .  +-U.  _  ). 
l    i      l+l    l-l 


(25) 


By  assuming  the  left    side  of   (25)   is   constant  we  are  left  with   a  tridiagonal 
system  of  equations.   Because  B  satisfies  the   requirements   for  the   Cuthill- 
Varga  method,    [Cuthill   and  Varga,    Page  237] >   we  can   define   a  diagonal 
matrix 


G  = 


'm-2  J 


such  that 


G  -""BG-1   =  T'T, 
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where 


T  = 


1        t 


1 
1        t. 


X 


X 

1    I 


r3 


The   elements   of  G  and  T  are  calculated  as   follows: 

1 
d     2     2 


1 


c     =   e2;    c     =    {e   -    (- )    }         ,    2   <   j    <  m-2, 

3  j-1 


and 


(26) 


tj    =   CJ    Vl 


1   <  J    <  m-3. 


(27) 


By  multiplying 


BU.    =  M.    -   C(U.    .    +  U.    _  ) 
l  l  l+l  l-l 


by  G       we  obtain 


G  1BG~1GU.    =   G_1M.    -   G  1CG'1G(U.^n    +  U.   ,  )    . 
l  l  l+l  l-l 


Substituting  Y.    for  GU.    gives 

G^BG^Y.    =   G_1M.    -   G^CG"1    (Y.      +Y         ) 
i  i  l+l        i-1 

or 

T  TY.=      G'^M.    -   G_1CG_1    (Y  +    Y        ) 

l  l  l+l  l-l 


(28) 


Equation    (28)    is   first   solved  for  TY.    and  then   for  Y . . 

i  l 

The  Cuthill-Varga  method  employs   a  two   part   algorithm  to  perform  SLOR, 


First  Y.    is   calculated  usim 

l 


T  TY 


1  1+1  1-1 


(29.1) 


if  i   is  odd  and 


*     T     equals   the  transpose  of  T. 
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T'x?.(*+1)   =   G"V   G^CG-1    (t[H1]    +  Y<_\+1))  (29.2)  'j 

if  i   is   even. 

Then  y  is    combined  with  Y.        to  perform  the  over-relaxation, 

i  i 

(ML)  .„  jjU+l)  _,<*>]  +YU)      _  (30) 

i  i  i  i 

After  all  the   iterations   are   completed,   Y.    is   converted  to  U     using 

G_1Y.    =  U.. 
l  i 

2.U.U     SLOR  Implementation 

Nov  let  us   discuss  more  specifically  how  we   implement   SLOR     to 

solve  — ^  +  — —     =   y.      Odd-Even  Storage  issued  for  the  reasons    stated  in 

6X  6Y  ; 

Section  2.1*. 2.      We  use  the  Cuthill-Varga  method  and  thus   do  the  following 

preconditioning:    calculate   c,   1  £  i   <  m-2;   t.,   1   <   i  ^  m-3;   —  ,   1   <  i   <  m-2; 

i  i 

and  —  ,   1   <   i   <  m-2.      We  use  PEM  storage   for  these  values   so  that  when 

(h  c.)2 

y  i 

we  calculate   c.   we  can   do   6k  of  the   square  roots   simultaneously  because  the 

square  root  takes   a  great    deal  of  time  relative  to  other  operations   on 

**  1  —1 

ILLIAC  IV.   We  use  Grabone  '  to  access  the  c,  t.,  -r—  and  =— •   We 

1        x        i  f-u        \2 

x  (h   ci) 

subtract   the      row  boundary   conditions   times  —     to  the   appropriate  row 

_  hd 

~_L  x 

of  f   and  then  we  multiply       by  G      .      We  multiply  column  boundary  conditions 

by  G  to   obtain  Y     and  Y     so  that  they  may  be  used  in  equations    (29)    and   (30). 

Now  we  are  ready  to   do  the   iterations.***    Two  phases    are   used.      Phase  1 

improves   6U  columns   simultaneously  and  Phase  2   improves   the  remainder. 

Phase  1    is   used  when  the  number  of  interior  columns   is   greater  than  or 

equal  to  128.      Phase  1   starts   at  the  left   and  works  on  groups   of  128  interior  col  urn 


We  limit   our  program  to  handle  an   even  number  of  columns. 

The  GLYPNIR   function   is   used  to   access   a  single  PE  value  and  send  it 
to  all   PEs.      Refer  to   Table  9. 


### 


The   reader  should  be   familiar  with  the  material    in  Appendix  A. 
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moving  to  the  right  -until  the  number  of  remaining  interior  columns  is  less  than 

128.   Since  there  are  elements  of  63  interior  odd  columns  on  the  first 

line  of  PEM  containing  interior  points  we  use  PI  and  CI  as  defined  in 

2.3.^.  *-2  rows  of  temporary  storage,  STORE,  are  used.  We  use  two 

similar  codes  for  Phase  1:   one  for  the  odd  columns  and  one  for  the  even 

columns.   Phase  2  can  use  the  same  code  as  Phase  1  if  the  mode  is  changed 

at  the  appropriate  time. 

We  will  limit  our  discussion  to  the  code  for  the  odd  columns,  Phase 
1.   The  rest  of  the  code  can  be  found  in  Appendix  D. 

ST0RE[0]:=  t[PI+Cl]-GRABONE(DISB[0],0)*(RTR(l, ,U[CI+l] )+U[PI+CI+l] ); 

LOOP  CJ:=  1,1, CM3  DO  BEGIN  C:  =  CJ.[ 16:^2]; 

CI:    =  CI+K; 

(31) 
STORE   [CJ]:=  ^[PI+Cl]-GRABONE(DISB[C] ,CJ)*(RTR(l, ,U[CI+l]  )+ 

U[PI+CI+1])-GMB0NE(E[C],CJ)*ST0RE[CJ-1]; 

END; 


LOOP  MC:=1,1,CM3   DO  BEGIN 
C  J :  =CR3-MC  ; 
C:=CJ.[l6:If2]; 

ST0RE[CJ]:=ST0RE[CJ]-GRAB0NE(E[C],CJ)*ST0RE[CJ+1];  (32) 

END; 


*  -1  -1 

DISB  contains  —  ,  4*  contains  G  M  and  E  contains  t.. 
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trtfTTAL  ci6nT)TtTon's'  Fflfi  J?'fll^WARL)  elimination 


Register 
$A 


$X 


Contents 
STORE  I C J- 1 J 


PI 


STEP  NO. 


ASSIGNMENT  STATEMENTS 
GRAB0NE(EICJ,CU) 


$B 
$A 
$S 


=  $A*$B 

=  *  [$X](0)-J 


$R:  =  RTR(1„U(2)) 
$A:  =  U[$X](2)+$R 


$R 
$A 
$A 


=  GRABONE(DISB[C],CU) 
=  $A^$R 
=  $S-$A 


STORE (I):  =  $A 


Register 

fco 


$C1 
$C2 


Contents 


CI 
CJ 
CI+1 


OPERATIONS  AND  OVERLAP 


TIME  IN 
CLOCKS 


Load  10 

Real  multiplication  9 

(PE  fetch  overlapped  by  mult. )  j  7 

Real  subtraction 

(PE  fetch  overlapped  by  subtract.)   3 

Route 

(PE  fetch  partially  overlapped 

by  route) 

Real  addition  11 

Load  10 

Real  multiplication  9 

Real  subtraction  7 

PE  store  I  7 


Subtotal 


7T 


INITIAL  CONDITIONS  FOR  BACKWARD  ELIMINATION 


Register 


Contents 


ST0RELCJ+1J 


Register 


"fccT 


Contents 


-$A~ 


CJ 


STEP  NO.  !  ASSIGNMENT  STATEMENTS 


OPERATIONS  AND  OVERLAP 


[TIME  IN 
'CLOCKS 


$B 
$A 
$A 


=  GRABONE(E[C],CJ) 

=  $A*$B 

=  STORE (0)-$A 


STORE(O):  =  $A 


Load  10 

Real  multiplication  !  9 

i 
(PE  fetch  overlapped  by  mult. )     j  7 

Real  subtraction ■ 

PE  store  7 


Subtotal 
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INITIAL  CONDITIONS  FOR  OVER -RELAXATION 


Register 


Contents 


Register 


Contents 


$X 

$C0 


PI 

CI 


$C1 

$C2 


CU 
W 


STEP  NO. 


ASSIGNMENT  STATEMENTS 


$R: 
$A: 

$A: 


UI$XJ(0) 
STORE (l)-$R 

$C2*$A 


OPERATIONS  AND   OVERLAP 


,TIME   IN 
CLOCKS 


$A:    =  $A+$R 
U[$X](0):    =  $A 


PE   fetch 

PE   fetch 

Real  subtraction 

(CU  fetch  overlapped  by   sub. ) 

Real  multiplication 

Real  addition 

PE    store 


Ik 

9 

7 


Subtotal 


Th~ 


TOTAL  TIME 


150 


Table  k.      Timing  of  an  Execution  of  Phase   1     SLOR 
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LOOP  CJ:=0,1,CM3   DO  BEGIN 

ST:=U[PI+Cl]; 

Tj[PI+Cl]:=W*(STORE[CJ]-U[PI+Cl])+U[PI+Cl]; 

IF  ABS(ST-U[PI+CI])    GRT  BOUND  THEN  B:  =1;  %  CHECKING  THE  ERROR  BOUND   (33) 

CI : =CI+K; 

END; 

SLOR  has  three  primary  sections:   (31)  does  the  forward  elimination,  (32) 
does  the  backward  elimination,  and  (33)  performs  the  over-relaxation. 
In  Table  k-   we  suggest  how  to  code  these  sections  efficiently  in  ASK  and 
we  give  a  time  estimate  for  that  code. 

2.5   Implementation  of  API  on  ILLIAC  IV 

ADI  improves  all  columns  simultaneously  and  then  improves  all  rows 
simultaneously.   Thus, when  we  do  the  column  (row)  improvement  we  want 
the  values  of  the  boundary  columns  (rows), but  the  boundary  rows  (columns) 
could  have  been  added  to  vector  t  in  preconditioning.   One  solution  is  to 
include  both  the  row  and  the  column  boundary  values  in  each  iteration. 
Another  solution  is  to  add  them  both  in  when  preconditioning  and  then  set 
the  boundary  values  to  zero  so  they  will  not  affect  the  interior  points 
when  used  in  the  iteration.   If  we  need  the  values  of  the  boundary  after 
the  problem  has  converged  then  we  must  store  them  before  performing  any 
of  the  iterations  and  replace  them  after  all  the  iterations  are  completed. 
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2.5.1  Thomas'  Method  on  ILLIAC  IV 

As  in  SLOR  we  have  sets  of  tridiagonal  matrices  to  solve.  We  used 
the  Cuthill-Varga  method  in  SLOR.   This  method  does  preconditioning  to 
cut  down  on  the  computational  time  (see  Section  2.U.3).   ADI  changes  the 
diagonal  at  every  iteration.   Thus  the  preconditioning  would  have  to  be 
done  at  each  interation.   Thomas'  Method  [Todd,  page  395]  has  the.  same 
problem  but  the  preconditioning  requires  less  computer  time  and  the  total 
computer  time  per  iteration  is  smaller.   Thus  we  will  use  Thomas'   Method 
in  solving  the  tridiagonal  matrix  problems.   Instead  of  implementing  the 
method  in  a  straight  forward  manner,  we  use  the  fact  that  the  horizontal 
coefficients  and  the  vertical  coefficients  are  constant. 

The  matrix  problem*  we  want  to  solve  is  of  the  form 
aTl  +  b  T2  =  °1 


bT 


±_±  +   aT.  +  bT.+1  =  D.   (1  =  2,  3,-...,  n-3), 


bT   5+aT  Q=D  0  • 

n-3     n-2    n-2 

where 

,2       .        1 

a v~2~  +  W£v  >   "b  =  —   for  row  relaxation 

y  h 

and  ,2  «  1 

a.  -  -[   2  +  w^j,        b  =  —       for  column  relaxation 

x  hx 

Thomas'   method  is   as  follows 

el=h   \  =  a-bV    n  (i  =2,   3/.,.,   n-3),  (3k) 

l-l 

Dx  D.        bq. 

ql=—>     qi  =  a-b"e     X"  (i  =2,   3,    ...,   n-2),  (35) 

i-1 

Tn-2  =  V2>   Ti  =  qi  -  ei  Ti+i     (i  -  n"3,    *4,    -..,    I)-  (36) 


N 


We  are   considering  an m    by  n    mesh,   U, 
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If  we  let  c  =  -  we  can  rewrite  the  equation  (35)  as  follows: 


b 


Lx  =  e]_  c  Dv      q.  =  e.  (c  D.  -  q.^)  (i  =  2,3,  ...,   n-2), 


and  e    =  ___J? .   This  reduces  the  number  of  divisions  which  are  re- 

3n-3 


n-2  "  a-b  e 


latively  time  consuming  on  ILLIAC  IV. 

We  have  to  calculate  the  e.'s  and  store  them.   For  each  row  (column), 
they  are  the  same.   Calculating  the  e.'s  is  a  serial  process.   The  only 
way  we  could  overlap  is  to  calculate  the  set  of  e.'s  for  different  itera- 
tions simultaneously.   The  problem,  with  this  solution  is  that  it  requires 
extra  storage.   If  we  calculate  all  the  e.'s  needed  for  32  iterations  we 
use  max  (m-2,  n-2), rows  of  storage.   If  storage  is  available,  then  this 
is  a  feasible  method;  but  if  we  must  use  disk  I/O,  the  computational 
time  saved  "will  be  overcome  by  increased  I/O  time. 

2.5.2  Skewed  Storage 

The  next  problem  we  must  overcome  is  how  to  access  the  rows  on 
one  sweep  and  then  the  columns  on  the  next.   Skewed  storage  was  designed 
just  for  that  purpose.  We  can  place  a  m  by  n  mesh  in  skewed  Storage  by 
performing  the  following  algorithm: 

1)  Calculate  the  number  of  rows  of  PEM  required  to  store 
n  points.   Set  K  equal  to  that  number. 

2)  Place  the  m  by  n  mesh  in  straight  storage. 

3)  Route  each  of  the  K  rows  of  PEM  containing  elements  of 
mesh  row  i.   Let  the  distance  of  the  route  be  i-1. 
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Figure  4 .   A  96  by  96  Mesh  in  Skewed  Storage 
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AD  I  is  a  simultaneous  method,  (i.e.  only  values  from  the  previous 
iteration  are  used,   see  (l5.l)  and  (15.2)).   On  ILLIAC  IV  ve  can  simul- 
taneously improve  up  to  6k   rows  (columns).   Thus  if  the  dimensions  of 
the  mesh  are  less  than  67  we  can  do  all  the  rows  (columns)  in  one  time 
step.   If  one  of  the  dimensions  is  greater  than  66  then  we  need  to  use 
more  than  one  time  step  for  row  or  column  relaxation.   In  this  case  we 
will  use  two  phases  to  complete  an  iteration.   Anm  by  n  mesh  will  be 


divided  using  R 


m-2 


and  J  =  j-rj—  •   Phase  1  row  (column)  relaxation 

will  improve  R  (j)  groups  of  6k   consecutive  rows  (columns).   Then  Phase  2 
will  improve  the  remaining  interior  rows  (columns).   Phase  1  starts  with 

row  (column)  2  and  works  toward  row  (column)  m-1  (n  -l)-   When  Phase  1  im- 

th 
proves  the  I   group  where  1  <  I  <  R  (j)  and  k  is  the  first  row  (column) 

of  group  I,  the  old  values  of  row  (column)  k-1  are  needed.   Thus  to  im- 
prove group  I  Phase  1  goes  through  the  following  steps: 

1)  The  values  of  row  (column)  k-1  are  placed  in  the 
temporary  storage ,   NEW. 

2)  The  values  in  the  temporary  storage,  OLD  are  placed 
in  the  PEM  storage  for  row  (column)  k-1. 

3)  The  values  or  row  (column)  k+63  are  placed  in  the 
temporary  storage,  OLD. 

k)     Rows  (columns)  k  through  k+63  are  improved. 

5)  The  values  in  NEW  are  placed  in  row  (column)  k-1. 

When  Phase  1  is  improving  the  first  group  steps  1,  2  and  5  can  be 
omitted.   Phase  2  needs  to  do  steps  1,  2,   k   and  5. 
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Now  we  will  discuss  the  type  of  indexing  used  in  accessing  the  mesh  points 
required  to  perform  Phase  I  column  relaxation  using   skewed  storage.   We  ac- 
cess the  set  {(\|r.  .,  U.  .  )|k  <  j  <  k  +  63}   by  using  a  permutation  of 
1  j  3    1, 3 

the  PE  integer,  (1,0,0,  •••,0).   PI  will  stand  for  that  permutation. 

PI  will  be  used  as  a  PE  index  and  CI  will  be  the  CU  index.   PI  must  be 

routed  one  PE  to  the  right  to  access  the  set  {U.  ,  . |k  <  j  <  k+63}.   PI 

l+J-j  3 

must  be  routed  one  PE  to  the  left  to  access  [U.    . |k  <  j  <  k+63}  •   To 

l-  1,3 

access  the  set  {U.  .   |k  <  j  <  k+63}  we  must  add  ETR(l,,Pl)  to  PI.   The 
1,  3"*"  1 

set  {U.  .  - jk  <  2  <   k+63}  can  be  accessed  using  only  CI.   After  being 

accessed  the  four  neighbors  of  U.  .  must  be  routed  to  the  PE  containing 

1,3 

U.  .  before  U.  .  can  be  improved.   The  following  GLYPNTR  code  uses  this 
1,3        i>3 

method  of  data  access  to  perform  Phase  1,  Step  k   of  column  relaxation: 
LOOP  C:=l,l,N-2  DO  BEGIN  CC:=  C.  [l6:^2];  CI:=CI+K; 
U[PI+CI]:  =  (AI*0  [PI+CI]-BETA*(RTL(1,,U[PI+RPI+CI])+RTR(1,  ,u[ci])) 
+CA*U[PI+Cl])-RTR(l,,U[LPI+CI-K]))*GRABONE(B[C-l],CC);  (37)" 

LPI :  =PI ;  PI :  =RPI ; RPI :  =RTR  ( 1, ,  RPI ) ; END; 

LOOP  IC:=0,l,N-3  DO  BEGIN  C:=N-2-IC; 
CC:+C.[l6:U2];  CI:=CI-K; 

U[PI+CI]:=U[PI+CI]-GRAB0NE(B[C-1],CC)*RTL(1,,U[RPI+CI+K]);  (38) 

RPI:=PI;   PI:=RTL(1, ,PI);END; 


*  The  variable,  k  equals  the  first  column  of  the  group  being  improved. 

2  2 

**  AI=-h   ;   BETA=  l/h     ;W  equals  the  relaxation  parameter  for 

this  iteration;  CA=2*BETA+W. 
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The  code  is  divided  into  two  sections:   Section  (37)  does  Thomas' 
forward  elimination.   Section  (38)  does  Thomas'  backward  elimination; 
Table  7  suggests  how  this  code  could  be  translated  into  ASK  and  how 
much  time  the  ASK  code  would  take. 

By  replacing  PI  with  KPEN=ETR(l, ,PENK)  where  PENK=(0,K,2k, . . . ,63K) 

and  K  is  the  number  of  lines  of  PEM  required  to  store  a  row  of  the 
mesh,  we  would  have  Phase  1  row  relaxation.* 

A  complete  program  for  a  m  by  n  mesh  where  n,  m  <  64  is  found  in 
Appendix  E. 


*  In  row  relaxation  a  PE  index  must  be  used  to  access  {U.  .  ,  k  <  j  <  k+63} 
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2.6  Summary  of  the  Results  for  Iterative  Methods 

Now  that  we  have  examined  implementation  of  the  various  methods  we 
can  make  some  recommendations  concerning  the  use  of  the  methods.   Table  6 
contains  a  numerical  study  of  convergence  rates  of  the  method  applied  to 


Poisson's  equation  where  h  =  h  =  

x    y     15 

error  bound  was  0.0001. 


on  different  mesh  sizes.   The 
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a)  co  is  the  relaxation  parameter 

b)  I  is  the  theoretical  number  of  iterations 

c)  I  is  the  actual  number  of  iterations* 

a 

d)  2ri  is  the  number  of  relaxation  parameters  used  in  ADI 
Table  6.   A  Numerical  Study  of  the  Methods  under  Discussion 

We  can  see  from  Table  6  that  ADI  is  consistently  the  fastest**  then 
comes  SL0R,  with  S0R  being  slowest.   This  conforms  with  theoretical  results. 

In  fact,  in  a  square  mesh  where  h  =  h  ,  SL0R  converges  /2  times  faster  than 

x    y 

S0R  while  ADI  converges  B  times  faster  than  S0R.   B  is  a  monotonically 
increasing  function  of  the  number  of  mesh  points  [Todd,  Pages  396-398]. 


#* 


Remember  ADI  improves  the  points  twice  in  each  iteration. 

The  speed  of  the  algorithms  is  compared  with  respect  to  the  number 
of  iterations  . 


ko 


To  compare  the  time  required  to  perform  one  of  the  above  methods  on 
ILLIAC  IV  to  that  required  to  perform  the  method  on  a  specific  machine -we 
need  to  calculate  the  clocks  per  point  per  iteration  and  compare  it  with  the 
clocks  per  point  per  iteration  on  ILLIAC  IV.  We  have  shown  that  SOR,  SLOR, 
and  ADI  can  be  programmed  to  enable  ILLIAC  IV  to  improve  64  points  simulta- 
neously  for  certain  mesh  sizes.   By  dividing  the  clocks  required  to  improve 
64  points  simultaneously  by  61+  we  obtain  the  clocks  per  point  per  iteration 
on  ILLIAC  IV.   Table  7  contains  these  values  and  other  pertinent  information. 


ADI 

SLOR 

S0R 

Number  of  operations  per 
point  per  iteration 

Mult. 

10 

k 

k 

Add. 

10 

6 

5 

Time  in  clocks  per  6k 
points  per  iteration 

Arithmetic 

160 

78 

71 

Data  Access 

164 

72 

22 

Total  clocks  per  point  per  iteration 

5.o6 

2.34 

1.45 

Optimum  mesh  size** 

641  by  64L 

1281  by  L 

641  by  2L 

Rows  of  PEM  required  for  temporary 

Max 
(n-2,m-2) 

m-2  +,  fo-2) 

e 

storage 

h     6k 

Table  7.   Summary  of  Time  and  Storage  Requirements  for  the  Iterative  Methods 
on  a  m  by  n  Mesh. 


Most  of  the  data  access  time  on  ILLIAC  IV  could  be  eliminated  if  it  was  a 
serial  machine.   Let  Machine  A  be  a  serial  machine  with  a  processor  similar  in 
speed  to  a  PE   but  with  the  ability  to  perform  ADI,   SOR,  and  SLOR  with  neg- 
ligible  data  access  time.     From  Table  7  we  can  see  that  Machine  A  would  tab 
71/1.45,  about  48,  times  longer  to  perform  SOR ' than  ILLIAC  IV  would  take. 
Machine  A  would  take  about  32  times  longer  than  ILLIAC  IV  to  perform  ADI  or 
SLOR. 


For  maximum  efficiency  on  ILLIAC  IV,  one  dimension  of  the  mesh  must  be 
a  multiple  of  6k   for  SOR  and  SLOR  while  both  dimensions  must  be 
multiples  of  6k   for  ADI. 


** 


I  and  L  are  positive  integers 

A  single  PE  has  approximately  twice  the  arithmetic  speed  of  a  CDC  6600. 


#### 


The  clocks  per  point  per  iteration  on  Machine  A  equal  the  arithmetic 
time  in  clocks  per  6k   points  per  iteration  on  ILLIAC  IV. 


1+1 


The  times  in  Table  7  are  assuming  the  inner  loops  of  the  programs  are 
written  in  ASK.   If  the  programs  were  written  completely  in  GLYPNIR  the 
times  would  about  double. 

Tables  6  and  7  indicate  that  ADI  is  the  fastest  iterative  algorithm 
on  ILLIAC  IV  for  rectangular  meshes.   The  theory  for  ADI  has  only  been 
developed  for  special  cases  [Young,  Page  555].   If  we  have  a  commutative 
case,  HV  =  VH,  then  ADI  will  be  effective.   The  commutative  space  requires 
a  rectangular  region  and  coefficients  of  the  differential  equation  which 
are  sufficiently  regular  [Young,  Page  538].   In  some  non commutative  cases, 
numerical  experiments  have  shown  ADI  to  converge  rapidly.  This  is  not 
always  true.   In  some  non  commutative  cases  ADI  fails  to  converge  for  cer- 
tain parameters  [Young,  Page  5^6],  More  work  needs  to  be  done  on  non- 
commutative  cases  before  ADI  can  be  used  freely  on  them. 

ADI  performs  row  relaxation  followed  by  column  relaxation.   Thus  to 
efficiently  perform  ADI  on  a  m  by  n  mesh  on  ILLIAC  IV,  both  m  and  n  must 
be  divisible  by  6k.      For  example  if  we  have  a  l6  by  6k   mesh  when  we  per- 
form row  relaxation,  ILLIAC  IV  will  be  working  at  25%   machine  efficiency 
while  for  column  relaxation  ILLIAC  IV  would  be  working  at  100%  machine 
efficiency.    If  one  had  four  16  by  6k   meshes  to  solve,  ILLIAC  IV  could 
solve  the  four  meshes  simultaneously  and  thus  brings  the  machine  effi- 
I  ciency  up  to  100%. 

SLOR  converges  in  fewer  iterations  than  SOR  and  can  be  programmed  to 
take  about  the  same  amount  of  time  per  iteration  as  SOR  on  most  serial 
machines.   On  ILLIAC  IV  SLOR  requires  data  from  one  PE  broadcast  to  all 
the  PEs  while  SOR  does  not.   This  is  the  major  factor  in  causing  SLOR  to 
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take   about   1.5  times  longer  per  iteration  than  SOR,    see  Table   7.      Thus 
SOE  is   a  faster  method  on  ILLIAC   IV  unless   it  takes   at   least   1.5  times 
more   iterations  to   converge.      This    is   rarely  the  case.      SLOR  requires 
considerably  more  temporary  storage  than  SOR.      Finally  SLOR  is   a  block 
iterative  method  and  thus    cannot  be   as   easily  reformulated  as   SOR  for 
efficient   performance  on   ILLIAC   IV. 

As   shown   in   Section  2.3  of  this    study,    SOR  has   a  few  constraints 
on  how  it  must  be   implemented.      This   enables   one  to  program  it    effi- 
ciently on   ILLIAC   IV  for  most  mesh  geometries.      SOR  seems  to  be  the  most 
promising  of  the   iterative  and  direct   methods   examined  by  the   author 
for  non-rectangular  meshes. 

In  this   study,   we   indicated  that   the  order  in  which  the  elements 
are   improved  by  SOR  does  not    affect   the   asymptotic  rate  of  convergence. 
This   can  be  misunderstood.      The  number  of  iterations   required  to  obtain 
a  specified  error  bound  is   dependent   on  the  order  in  which  the   elements 
are   improved.      For  specific   initial   conditions,   one  ordering  might   give 
a  faster  initial   rate  of  convergence  than   another.      In  some   applications 
of  Pois son's   equation  the  actual  number  of  iterations   required  for   con- 
vergence is   substantially  smaller  than  theoretically  expected  because   a 
good  initial   guess    is   supplied.      In  these  cases   the  ordering  has    a  greater 
affect   on  the   computer  time  required  to   get   an  acceptable  solution.      ADI 
and  SLOR  require  a  considerable  amount   of  preconditioning  before  the 
iterative  process   can  begin.      Thus   if  the   initial   guess   is   good  enough, 
SOR  will  take  the  least  amount    of  computer  time  on  any  mesh. 
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This   study  examined  straight   and  odd-even  storage  for  SOR.      The 
use  of  odd-even   storage   improved  the  speed  of  SOR  by  10%.      In  most 
problems,    straight    storage  is   used.      Thus,    for  a  valid  comparison  of 
the  two   storage  schemes,   the  time  to   convert   straight   storage  to  odd- 
even   storage  must  be   considered.      This  takes   about  the  same  amount  of 
time  as   one  SOR  iteration.        Thus  the  program  must   require   at   least 
20  iterations  to  justify   converting  to  odd-even  storage  and  then  back 
to  straight   storage. 
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3.   IMPLEMENTATION  OF  DIRECT  METHODS  ON  ILLIAC  IV 

3.1      Introduction  to  Hockney's   Direct   Method   (FACR ) 

Direct  methods   for  solution  of  a  restricted  class   of  Poisson's 

equation  have  "been  developed  which  are   faster  than  any  iterative  method 

developed  to    date    [Dorr,   Pages   258-259].      To  the  author's  knowledge, 

Hockney's   Fourier  Analysis /Cyclic  Reduction    (FACR)    is  the   fastest 

direct  method  on   serial  machines    [Hockney,    Page  159].      In  this  paper 

we  will  examine  FACR  and  modify  it   so  that    it   can  be  programmed 

efficiently  on   ILLIAC   IV.      FACR  solves  the   "five-point"   difference 

formula  on   a  rectangular  mesh;    namely, 

u     i    +    -2U     .    +  U   ._    .  U     ,    _    -2U     ,    +  U     ,  ._ 

s-l,t  s,t  s+l,t  s,t-I     ■      S,t  S,t+1  „,  /       v 

^2  ,2  "      s,t  Uy; 

h  h  ' 

x  y 

N 
where  the  number  of  points  being  computed  in  one   direction   is   2     -1   for 

N 
Dirichlet's  boundary  conditions,    2     for  periodic  boundary   conditions   and 

N 
2     +1   for  Neumann's  boundary   conditions.      These  three  boundary  conditions 

are  permitted  in  the  x  and  y   direction,    giving  nine  possible   combinations 

of  boundary   conditions. 

A  2     by  2     mesh  produces   a  linear  system  with  2  unknowns.      Let 

M  =  2     and  N  =  2    .      FACR  solves  this   system  using  the  following  five  step 
algorithm: 

1.  Given  y  compute   ¥      for  the  even  columns    and  overwrite  ¥  with  ¥     on 
the   even  columns . 

2.  Using  T     compute  ¥      and  overwrite  ¥     with  ¥    . 

3.  Using  ¥     solve   for  U     and  overwrite  ¥     with  U   . 

i  s  s 

U.   Using  U  compute  U  on  the  even  columns  and  overwrite  U  with  U  on  the 

even  columns. 


h5 


5.  Using  the  values  of  U  on  the  even  columns  solve  (39)  for  the  values 
of  U  on  the  odd  columns  and  overwrite  ¥  with  U  on  the  odd  columns. 

The  superscripts  are  for  Dirichlet's  "boundary  conditions  and  are 
defined  as  follows: 

ur     =  w        _ 

i,J    i,j+l 


M-l 
2   r  ...*    .   Trik 


fi,j  =  B  J,  vl)  Sln  — 


2  V„     ,   ,ik 


Ui,J  =  M   I     Uk,J 
k=l   * J 


Odd/even  reduction  divides  the  problem  into  two  parts;  first  to  solve  a 
linear  system  concerning  even  columns  and  secondly  to  solve  for  the 
values  on  the  odd  columns.   The  even  columns  form  a  linear  system  with  M^ 
unknowns  which  is  solved  by  steps  2,  3,  and  k.      Step  5  involves  the  r- 
linear  systems  with  M  unknowns  that  give  the  values  for  the  odd  columns. 


Fourier  analysis  decouples  the  even  columns  into  M  tridiagonal  linear 

N  s 

systems  with  —  unknowns.   The  unknowns  are  the  Fourier  harmonics,  U  ,  of  U. 

Recursive  cyclic  reduction  solves  the  linear  systems  for  the  Fourier 
harmonics  U  . 


Fourier  synthesis  converts  the  Fourier  harmonics  U  obtained  by  the 
previous  step  to  U  giving  the  values  of  U  on  the  even  columns. 


The  values  of  U  on  the  even  columns  are  used  to  calculate  the  values 

N 
of  U  on  the  odd  columns.   This  involves  solving  —  tridiagonal  linear 

systems  with  M  unknowns  [Hockney,  Pages  1U8-153]. 
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Refer  to  Appendix  B  for  a  more  detailed  explanation.  Note  that  all  five 
ste-DS  can  be  performed  using  one  mesh  containing  the  "boundary  conditions  and 
¥  initially.   This  is  overwritten  by  f  ,  !  ,  U  and  finally  the  solution  U. 

We  will  discuss  the  method  in  three  parts,  steps  1  and  5S  step  3, 
and  steps  2  and  k.      Suggestions  will  he  made  for  the  application  of 
each  part.   Finally  we  will  present  a  program  in  GLYPNIR  similar  to 
FACR  and  make  suggestions  on  how  it  can  he  programmed  for  ASK. 


3.2  Odd/Even  Reduction  and  Odd  Column  Solution 

Odd/even  reduction  is  used  to  cut  down  on  the  number  of  columns 
where  Fourier  analysis  and  synthesis  is  applied.   Consider  the  three 
neighboring  equations: 

#  "t-2  +  BUt-l  +  ^  Ut  =  Vl 

y  y 

\  \ 

$  ut  +  But+1  +  ^  <w  -  Vl 

y  y 

where  ut  is  the  t   column  of  the  mesh  and  t  is  even.   By  multiplying 

2 
the  middle  even  line  equation  by  -Bh  and  adding  we  obtain 


-  ~          2  2                 -1 

2   U,  +   I     2   -  B^  ti        U     +     ? 

l        t-2  Ih  y  /     t        h 

y  \  y  *  I          V 


■  b2  i )  ut +  h2  «t+2  =  vi  -  biv  \ +  \-i  •  \  iki) 


Note  equation    (kl)    contains  only   even   columns   of  U,   that   is   U     „,   U 

t-2        t 

311(1  Ut+2"      Thus   after    ^i)   is    solved  the  algorithm  must    solve   for  the 
values   of  U  on  the  odd  columns. 
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To  apply  odd/ even  reduction  one  needs   an  odd  number  of  mesh  points 
for  Dirichlet   or  Neumann  boundary   conditions   and  an  even  number  of  mesh 
points   for  periodic  boundary  conditions. 

Odd/even  reduction  halves  the  number  of  columns  that   require  Fourier 
analysis  and  synthesis.      The  odd  columns  are   solved  as  tridiagonal 
matrix  problems,    step   5.        Cyclic  reduction  is   used  to   solve  for  the 
odd  columns  of  U.      This  has  the  potential   of  saving  computational  time. 
Hockney  showed  that   one  level   of  odd/even  reduction  reduces  the  total 
number  of  operations  per  point  on   a  serial  machine   from  3^  to  2k  on  a 
128  x  128  mesh   [Hockney,    Page  l6l]. 

Let  k  be  the  number  of  interior  columns   in  the  mesh.      On   ILLIAC  IV 
Fourier  analysis   and  synthesis   can  be  applied  to  6k  of  these   columns 
simultaneously.      Thus  -without   odd/even  reduction   a  Fourier  algorithm 

5T-  I  times.        If     odd/even  reduction   is   used  a  Fourier 
algorithm  must  be  applied  iTna  I    times. 

Odd/ even  reduction   is  of  no  use   if  k  <   6k  because  then   \tt\  =  [t^o'  I  . 
If  k  >  6k  then   yzT~\  >L  9o  1  and  odd/even  reduction  would  reduce  the 
number  of  times  Fourier  analysis   and  synthesis  must  be   applied.      The 
best   case   is  where    j  tj-J  =  2  pr^TT  J     in  which   case  the  number  of  Fourier 
analysis   and  synthesis   is   cut    in  half. 

Straight  storage  is  the  standard  storage  scheme  used  by  most  pro- 
grams. If  odd/even  reduction  is  used,  maximum  machine  efficiency  is 
obtained  when  an  even  column  is   contained  in  each  PE  for  the  odd/even 


* 

fPl  equals  the  integer  L  where  L  -  1  <  P  <  L. 
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Figure    5.      Storage   Schemes  Used  on  Rows  k  and  5   of  a 
96  by  96  Mesh  if  Odd/Even  Reduction  is  Used 
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reduction  step,  the  Fourier  analysis  step  and  the  Fourier  synthesis  step; 
CRED  can  access  a  different  row  in  each  PE;  and  during  the  solution  of 
the  odd  columns  an  odd  column  can  be  accessed  by  each  PE.   The  following 
storage  changes  will  fulfill  all  these  conditions  assuming  each  row  of 
the  mesh  requires  an  even  number  of  rows  of  PEM. 

1.  We  assume  the  main  program  supplies  U  in  straight  storage. 

2.  Before  odd/even  reduction,  the  even  rows  of  PEM  containing 
U  are  routed  one  PU  to  the  right  of  the  odd  rows. 

3.  Before  CRED,  U  is  placed  in  skewed  storage. 

h.      Before  Fourier  synthesis,  U  is  returned  to  the  storage  used 
for  odd/ even  reduction. 

5.   The  mesh  U  is  placed  in  straight  storage  at  the  end  of  the 
subroutine. 

See  Figure  5  for  clarification  of  the  above  storage  schemes.   If  odd/ 
even  reduction  is  not  used  steps  2  and  h   of  the  storage  changes  need  not 
be  executed.   The  use  of  odd/even  reduction  requires  the  extra  storage 
changes  because  steps  one  through  four  of  FACR  are  applied  to  even 
columns  while  step  five  solves  the  odd  columns .   Without  the  use  of 
odd/even  reduction  steps  2,  3,  and  h   in  Section  3.1  would  be  applied  to 
all  the  columns. 

3.3  CRED 

In   step   3,    section   3.1  we  have   a  tridiagonal   system  of  equations  to 
solve.      The   derivation  of  the  coefficients   for  Dirichlet's  boundary 

conditions    is   contained  in  Appendix  B.      By  multiplying  equation    (B-8)    in 

2  th 

Appendix  B  by  h       the   diagonal   coefficient   for  the  k        row  becomes 
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-=-K^-¥-«(^^4-)  -^+6^+8g+2; 


h   '  h      h 

X  XX  XX 


Hockney  solves  these  tridiagonal  systems  using  cyclic  reduction  recursively. 
Given 

lt-U     +\lt-2+lt  =\2^,t-l 

lt-2  +  \  It  +  lt+2  =hy2lt     ^ 

Uk,t  +  **  Uk,t+2  +  K,t+k       ■■  h/  *k,fl 

where  t    is   even. 

By  multiplying  the  middle  equation  by  -A^and  adding  we  obtain 

it-* +  <2  -  #  it +  it+u  ■  \2  (\,t-2  +  \,t+2  -  «;,*' 

(1*3) 

Now  the  process   is   repeated  on  the   equations   until  we   are  left  with  just 
one   equation.      Then  the  process    is   reversed   [Hockney,    Page  150].      This 
method  uses   Log  N  extra  words   of  storage,    UN  -  2LogpN  additions,    2N 
multiplications   and  Log  N  divisions   for  an  N  variable   system.        It    also 
requires  that    N  equal  2     -  1   if  the   system  has   Dirichlet   conditions,    2 
if  the   system  has  periodic   conditions,    and  2     +  1   if  the   system  has 
Neumann  conditions. 

For  Dirichlet  boundary   conditions   Thomas'   method  allows  one  to 
solve  a  tridiagonal   system  of  N  variables  where  there   are  no  restrictions 
on  N,    (See  Section   2.5.1).      In  the   case  where  the   diagonal   elements    are 
equal   and  all  the  other  non-zero   elements    equal   1,  we   calculate 


* 

N  is  the  number  of  interior  even  columns  in  the  mesh. 
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XI  =  IT  '    Sk,i  =    A   -ej    .    *  for   i  =   2,3,...,N.  (i*) 

k  '  k     k,i-l 


"k,i =  *k,i  ek,i  ••  \,± =  (Tk,i-  Vi-i'  ei  for  i  =  2'3----'H-  ^s) 


Uk,H  *  «k.H'    Uk,i=  4k,i"   ek,i   li+l      fOT  *  =  B-1'"-2'  ■-1-  ik6) 


This  method  uses  N  words  of  extra  storage,  N  divisions,  2N  multipli- 
cations, and  3N  additions.   On  ILLIAC  IV  a  division  requires  as  much  time 

as  six  multiplications.  Thus  the  calculations  of  the  e.'s  takes  most  of 

1 

the  time  required  for  Thomas'  method.   Since  the  e.'s  are  independent  of 
U  and  V   they  could  be  calculated  in  preconditioning  and  used  repeatedly 

by  Thomas'  method.   Since  each  row  has  a  different  A,  we  need  a  different 

J  k 

set  of  e   . ' s  for  each  row  of  U.   Thus  all  the  e   . 's  would  require  half 
K.,  i  k ,  x 

as  much  storage  as  U  if  odd/even  reduction  is  applied.   Thomas'  method 
can  be  applied  to  an  arbitrary  M.   This  not  only  allows  the  program  to  be 
more  general  but  in  some  cases  can  save  storage  in  placing  U  and  ¥  in  PEM. 
If  a  problem  with  Dirichlet  or  Neumann  boundary  conditions  required  the 
accuracy  obtained  by  6h   by  6k   meshes,  applying  cyclic  reduction  would 
require  the  meshes  to  be  65   by  65  while  Thomas'  method  would  accept  65 
by  6k   meshes.   In  straight  storage  a  65  by  65  mesh  requires  130  lines  of 

PEM  while  a  65  bv  6k   mesh  reauires  65  lines  of  PPM.   Tf  thp  boundarv 
conditions  are  periodic,  cyclic  reduction  would  require  a  6k   by  6k   mesh, 
and  would  require  only  6k   lines  of  PEM.   The  extra  storage  required  to 

store  the  e   . 's  for  Thomas'  method  is  too  great  for  some  memory  bound  problems 
k,  1 

By  changing  (k6)   to 

Uk,N  =  V  Uk,N-l  =  \,N  '  \  Uk,N; 
s      s  (W 

V  =  V1+1 "  Ak  uk,i+i  -  uk,i+2    for  i  =  N-2>  N-3>  •••> 1- 
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we  eliminate  the  need  for  extra  storage.  We  will  refer  to  this  method 
as  modified  Thomas'  method.  It  takes  N  divisions,  2N  multiplications , 
and  UN  additions. 


Table   8  summarizes  the  storage   and  computational  time  of  cyclic 
reduction,    Thomas'   method  and  modified  Thomas  method  where  the   divisions 
are  done   in  preconditioning  for  Thomas'   method  and  cyclic  reduction   and 
the  divisions   are  not   done  in  preconditioning  for  the  modified  Thomas'  method 
The   computational  time   is   calculated  by  multiplying  Ts    9   and  56   clocks  by 
the  number  of  additions,   multiplications   and  divisions   respectively  and 
is   for  6k  rows  of  U.      The  storage  is   in  words   not   lines  of  PEM. 


Mesh  Size 
Restrictions 

Extra  Storage 
Requirements 

Time  per  6k 
Rows   in  Clocks 

Thomas'    Method 
with  odd/ even 
reduction 

An  odd  number 
of  columns 

Thomas  ' 
Method 

NM/2  words 

39N/2 

Mod.    Thomas' 
Method 

No  words 

102N/2 

Thomas'    Method 
without   odd/even 
reduction 

No 
Restrictions 

Thomas  ' 
Method 

NM  words 

39N 

Mod.    Thomas  ' 
Method 

No  words 

102N 

Cyclic  Reduction 

N  =  21  +  1 

MLog     N  words 

1+6N  -  Iklogjt 

Table   8.      Comparison  of  Three  Methods  to  Solve   Tridiagonal  Matrix 

Equations   for  Dirichlet's   or  Neumann's  boundary   conditions 


Odd/even  reduction   is  most  helpful  when  f^]  =  2   fj|g]     where  k  is   the 
number  of  interior  columns    (see  Section  2).      The  time  estimates  of  Table  8 
for  Thomas'    method  with  odd/ even  reduction  are  only  half  the  corresponding 
values    for  Thomas'   method  without   odd/even  reduction.      This   is  because 
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odd/ even  reduction  decreases  the  size  of  the  tridiagonal   system  to  he 
solved  hy  a  factor  of  2. 

For  cyclic  reduction  the  amount   of  extra  storage   in  Tahle  8  is 
misleading.      If  I   >  6  then  6Um  words  of  extra  storage  are  required  per 
mesh  in  PEM.      If  both  U  and  ¥  are  in  PEM,then  128M  words   are  required. 
This   is  because  cyclic  reduction  requires   2     +1   columns   for  Dirichlet 's 
or  Neumann's  boundary  conditions  while   a  similar  error  bound  could  be 
obtained  using  Thomas'   method  and  2     points. 

For  cyclic  reduction  on  Neumann's  boundary   conditions,    if  I   >  6 
then  an  extra  Fourier  analysis   and  synthesis  must  be  performed  to  solve 
the  last   column.      This  time   should  be  divided  by  the  number  of  rows   and 
added  to  the  time  per  row  for  cyclic  reduction   on  Neumann's  boundary 
conditions. 

If  odd/ even  reduction  is   used  then  the  odd  columns   of  U  are   calculated 
by  solving  tridiagonal  systems.      These   systems   are  already  restricted  in 
size  by  the  FFT  performed  on  the  even  columns.      One  cyclic   reduction  pro- 
gram can  be  written  to  solve   for  U  on  the  odd  columns   for  all  three 
boundary   conditions.      Thus   cyclic   reduction  will  be  used  to   solve   for  U 
on  the  odd  columns  when  odd/even  reduction   is   employed. 

3.^     Fourier  Analysis   and  Synthesis 

Hockney  wrote  one  routine  which  performs   Fourier  analysis  or  synthesis 
for  any  of  the  boundary  conditions    [Hockney,   Page  201].      His   routine   assumes 
that  there  is   plenty  of  temporary  storage  available.      Although  this   is  true 
on  most   large   serial  machines   ILLIAC   IV  has   a  relatively  small   amount   of 
storage  compared  with   its   speed.      Thus   a  method  which   is   as   fast   as 
Hockney' s  Fourier  routine  but   does  not  require  extra  storage  would  be 
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preferred.      Cooley,   the  author  of  FFT,  presents  such  a  method  in 
[Cooley,   Page   320-323].      Cooley  shows  that   it   is  possible  to  use  the 
complex  fast  Fourier  transform  to  obtain  the   sine  series    (required  for 
Dirichlet  boundary  conditions),      cosine  series    (required  for  Neumann 
boundary   conditions)    and  real   series    (required  for  periodic  boundary 
conditions).      Both  Hockney's   and  Cooley' s  methods  place  restrictions 
on  the   interior  mesh  sizes:      21  -1   for  Dirichlet,    2     for  periodic   and 
2+1   for  Neumann. 


Cooley' s   algorithm  for  Dirichlet  boundary  conditions   goes   as 
follows 


.      let  M  =  21,   Y(j)    for  j    =1,    ...,   M-l  be  the   coefficients   of 


M-l  .^ 

the  Fourier  sine  series,   b(k)  =     £     Y(j)   sin     „     . 

J=l 

Y(0)   =  Y(M)   =  0,    and  Y(j)   =  -  Y(2M-j)   =  -  Y(-j)    for  j   =  0,    . . . ,   M. 

Define 

X(j)   =  -[Y(2j+1)   -  Y(2j-1)]   +  Y(2j)i   for  J   =  0,    ...,   M/2. 

Thus 

X(0)    =  ~[Y(1)    +  Y(-l)]   +  Y(0)   =   -2Y(l)   +   Oi 

and 

X(M/2)   =  ~[Y(M+l)    -  Y(M-l)]    +  Y(M)i   =   2Y(M-l)   +   Oi. 
For  j   =1,    ...,   M/2-1, 

X(M/2+j)    =  -   Y(M+2j+l)    +  Y(M+2j-l)   +  Y(M+2j )i   =  Y(M-2j-l)    -  Y(M-2j+l) 
-  Y(M-2j)i. 

Let  X(j  )   =  C(j)   and  calculate  A^j)    and  Ag(j  )   for  J   =  0,   1,    . . . ,   M/k 
using 

VJ)   =  °(J)   +  C(M/2+j)    andA2(j)   =    [c(j)    -  C(M/2+j)]W^ 

where 

WM  =  C0S  ¥"    +  ±  SIN  ¥"  • 


Then 


calculate  A(j)    and  A(M/U+j  )    for  j    =   0,   1,    . . . ,   M/U-l 
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vhere 


A(j)   =  Ax(j)    +   i  A2(j),   A(MA+j)    =  A±    (j)    -  iA2(j). 
Thus, for  J   =  0,   1,    ...,   M/U, 

A(j)    =  -Y(2j+1)    +  Y(2j-1)    +  Y(M-2j-l)    -  Y(M-2j+l)    +  SIN  ^_  [Y(2j+l) 
-   Y(2j-1)    +  Y(M-2j-l)    -  Y(M-2j+l)]   +   COS  ^L  [Y(2J  )    +  Y(M-2j)] 
+   i{Y(M-2j)    -  Y(2j)    +  SIN  ^li(Y(2j)    +  Y(M-2j ) ]   +  (1+8.1) 


COS 


—A-      [Y(2j-1)    -   Y(2j+1)   -   Y(M-2j-l)    +  Y(M-2j+l)]} 


and  for  J    +1,    . . . ,    M/k 

A(M/2-j)    =   Y(2j-1)    -  Y(2j+1)    -  Y(M-2j+l)    +  Y(M-2j-l)    - 

SIN  2gl   [Y(2j+1)    -  Y(2j-1)    +  Y(M-2j-l)    -  Y(M-2J+l)]    - 
COS  ^-  [Y(2j)    +  Y(M-2j)]    -   i  {Y(M-2j)    -  Y(2j)    - 


SIN  ^L  [Y(2j)    +  Y(M-2j)]   -   COS  ^f-  [Y(2j-l)    -  (U8.2) 


^-  [Y(2j)    +  Y(M-2j)]   -   COS  & 
Y(2j+1)    -  Y(M-2j-l)    +   Y(M-2j+l)]}. 


Now  the   complex  fast   Fourier  transform  is   applied  to  A  giving 


N/2-1 


Tjk 


X(j)   =        I       A(k)w;L    for  j   =  0,    1,    ...,    M/2-1. 
k=0 

For  j  '  1,   3,   5,    ...,   M-3  let 

b(j)   =  X   (p~)    imaginary  and 

b(j+l)   =  X   (^)    real;   b(m-l)   =  X(|-l)    imaginary. 

Finally  b(j)    for  j   -  1,    ...,    M-l   is   calculated 

Ub(j)   =    [b(j)    -  b(M-j)]    -    [b(j)+b(M-j)]/[2Sin  *£],  (U9) 

M-l 
Now  we  have   Ub(j)   =   k      £     Y(k)   Sin  ™-     . 

k=l 

Fourier  analysis  calculates  —  b(j)  and  Fourier  synthesis  calculates  b(j). 
Since  we  are  dealing  with  linear  equations  we  can  multiply  p  in  (l)  and 
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0 

1 

$B:=P[$X](2) 

PE  Fetch  overlapped 

0 

1 

$A:=$Sj$S:=$B 

Two  PE  Register  Transfers 

2 

1 

P[$X](2):=$R+$A 

Addition,  PE  Store 

Ik 

1 
j 

1 

$C2 : =$C2+$D2 

CU  Addition  overlapped 

0 

,M-lv  TNI 

1     1 

$R:=P[$X](2) 

PE  Fetch  overlapped 

0 

3°^  y 

P[$X](2):=$H+$P 

Addition,   PE  Store 

Ik 

$C2:=$DO+$D5-$D2 

CU  Addition  overlapped 

0 

$S:=P[$X-(2) 

PE  Fetch  overlapped 

0 

$C3:=$D5 

CU  Register  Transfer  overlapped 

0 

P[$X](3):,-$S-$S 

Addition,  PE  Store 

14 

$C3 : =$C3+$D2 

CU  Addition  overlapped 

0 

$R:=P[$X](3)            ^ODD3 

PE  Fetch  overlapped 

0 

P[$X](3):=-$R-$R 

Addition,  PE  Store 

li+ 

.    TNT 

$CO:=$D3 

CU  register  transfer  overlapped 

0 

teU 

2 

$C3 : =$C3+$D3  J  $C2 : =$C3 -$D2 

CU  Addition  overlapped 

0 

2 

$R:=P[$X](3)-$R 

PE  Fetch  overlapped,  Addition 

7 

2 

$D6: =GRABONE (DS [GL] ,N2-IL) 

CU  Load 

10 

2 

$D7 :  =GRABONE  (DS  [GL] , IL) 

CU  Load 

10 

2 

$S : =$D6*$R 

Multiplication 

9 

2 

$A:=P[$X](2)*$D7 

PE  Fetch  overlapped,  Multiplication 

2 

$S :  =$A-$S           %0DD2 

Addition 

7 

2 

$R:=$R-4D7 

Multiplication 

9 

2 

$A:=P[$X](2)*$D6 

PE  Fetch  overlapped 

9 

2 

$R:=$A+$R           %ODDl 

Addition 

7 

2 

$C3:=$DO+$DU-$CO 

CU  Addition  overlapped 

0 

2 

$CO:=$CO+$D3;$Cl:,$C3-$D3 

CU  Addition  overlapped 

0 

2 

P[$X](2):=P[$X](3)-P[$X](1) 

Two  PE  Fetches,   One  overlapped 
Addition,  PE  Store 

21 

2 

$C2 : =$C2+$D2 ; $C1 : =$C1+$D2 

CU  Addition  overlapped 

0 

2 

ODD3:=P[$X](2) 

PE  Fetch  overlapped,   PE  Store 

7 

2 

P[$X](2):4S-P[$X](1) 

PE  Fetch,  Addition,   PE  Store 

21 

2 

P[$X](1):4S+P[$X](1) 

PE  Fetch  overlapped,  Addition,   PE 
Store 

11+ 

2 

$C2:=$C2-$D2 

CU  Addition  overlapped 

0 

2 

$S:=P[$X](2) 

PE  fetch  overlapped 

0 

2 

P[$X](1):4S-$R 

Addition,   PE  Store 

Ik 

2 

$B:=$R 

PE  Register  Transfer  overlapped 

0 

2 

$R:=ODD3 

PE  Fetch  overlapped 

0 

«^>I3 

2 

P[$X](2):=$S+$B 

Addition,  PE  Store 

Ik 

$C2 :  =$DU+$D1 ;  $C3 :  =4Di++$D5 

CU  Addition  overlapped 

0 

$R:=P[$X](2) 

PE  Fetch  overlapped 

0 

P[$X](2):=P[$X](3) 

PE  Fetch,   PE  Store 

Ik 

*® 

P[$X](3):=$R+$R 

Addition,   PE  Store 

Ik 

SUBTOTAL 

(57M-155)f^| 

Table  9.      Preparation  of  the  Data  for  the  Fast  Fourier  Transform* 


*The  logic  for  the  code  to  keep  current  values  in  the  ADB's  defined  in 
the  initial  conditions  is  supplied  in  GLYPNIR,  Appendix  B. 
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INITIAL  CONDITIONS 


REGISTER 


CONTENTS 


REGISTER 
$D2 


CONTENTS 


REGISTER 

Id! 


CONTENTS 


REGISTER 
|D8 


CONTENTS 


IF 

$DO 
$D1 


PI 
J 

L 


$D3 

$DU 


K 
KP 

KM 


$D6 

$D7 


ILJ 

IP 

Jl 


$D9 
$D10 


12 

CJ 
IT 


!  = 

'ASSIGNMENT  STATEMENTS 


'OPERATIONS  AND  OVERLAP 


TIME  IN 
CLOCKS 


TOTAL  TIME 
FOR  LOOP 


$C2;=$D5 


CU  Register  transfer  overlapped 


$C3:=$C2+$D1 
$S:=P[$X](2) 
P[$X](2):=$S+P[$X](3) 
P[$X](3):=$S-P[$X](3) 

$C2:=$C2+$D2 


!CU  Addition  overlapped 

'PE  Fetch  overlapped 

IPE  Fetch,  Addition,  PE  Store 

PE  Fetch  overlapped,  Addition,  PE 

Store 
CU  Addition  overlapped 


0 

0 

21 

lit 

0 


35 


,M+l,rN-| 


$C0:=$D3  ;CU  Register  Transfer  overlapped 

$C2;=GRABONE( INDEX [GL],IL);  JCU  Load 

4 


o 

10 


10(^, 


$R:=P(0) 

P(0):=P(2) 

P(2):=$R 

$C0 ; =$C0+1 ; $C2 ; =$C2+1 


jPE  Fetch 

jPE  Fetch,  PE  Store 

iPE  Store 

JCU  Addition  overlapped 


7 
lit 

7 
0 


28 


<¥ffl 


$D11 : =GRABONE ( DS [ GL ] , N2-N11 ] 

*$D8  JCU  Load;  Multiplication 

$D12:=GRAB0NE(DS[GL],N11)   CU  Load 


1 


19 
10 


29(M-1) 


$C2:=$Dl++$D7+$D5 

$C3 : =$C2+$D9+$D9 ; $C1 : = 

$C3+$C2 
$R:=P[$X](3)*$D11 
$S:=P[$X](1) 
$B:=$S#$D12 
$R:=$R-$B     #0DD1 
$S:=$S#$D11 
$A:=P[$X](3)*$D12 
$S :  =$A+$S     #0DD2 
P[$X](3):=P[$X](2)-$R 

P[$X](2):=P[$X](2)+$R 

$C2:=$C2+$D2 
P[$X](1):=P[$X](2)-$S 

P[$X](2):=P[$X](2)+$S 


;CU  Addition  overlapped 

CU  Addition  overlapped 

PE  Fetch  overlapped,  Multiplication 

PE  Fetch  overlapped 

Multiplication 

Addition 

Multiplication 

PE  Fetch  overlapped,  Multiplication 

Addition 

PE  Fetch  overlapped,  Addition,  PE 

Store 
PE  Fetch  overlapped,  Addition,  PE 

Store 
CU  Addition  overlapped 
PE  Fetch  overlapped,  Addition,  PE 

PE  Fetch  overlapped,  Addition,  PE 
Store 


0 
9 
0 
9 
7 
9 
9 
7 

lit 

lit 
0 

Ik 
lit 


106(Log2(M)-2) 


Wl 


53 


SUBTOTAL       (^(M+l)Log2(M)+5M-79) 


N 

55 


3^-5^ 


Table  10.   The  Complex  Fast  Fourier  Transform* 


The  logic  for  the  code  to  keep  current  values  in  the  ADB's  defined  in  the 
initial  conditions  is  supplied  in  GLYPNIR,  Appendix  B. 
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INITIAL  CONDITIONS 

REGISTER    CONTENTS    REGISTER 

CONTENTS    REGISTER 

CONTENTS 

REGISTER 

CONTENTS 

$X          PI         $DO 

J          $D1 

I 

$D2 

ILJ 

LOOP   ASSIGNMENT  STATEMENT 

OPERATIONS  AND  OVERLAP 

TIME  IN 
CLOCKS 

TOTAL  TIME 
FOR  LOOP 

$D3 : =GRABONE ( IS [ GL ] , 11 ) 

CU  Load 

10 

10  <^) 

6 

$C1:=$D2+$D1;$C2:= 

$D0-$D1+$D2 

CU  Addition  overlapped 

0 

6 

$R:=P[$X](1)  . 

PE  Fetch  overlapped 

0 

6 

$B:=P[$X](2) 

PE  Fetch 

7 

6 

$R:=$R=$B 

Addition 

7 

6 

$A:=P[$X](1) 

PE  Fetch  overlapped 

0 

6 

$C3:=$D3 

CU  Register  transfer  overlapped 

0 

6 

$A:=$A+$B 

Addition 

7 

6 

$S:=$A*$C3 

Multiplication 

9 

6 

P[$X](1):=$R+$S 

Addition,  PE  Store 

ll+ 

6 

P[$X](2):=$S-$R 

Addition,  PE  Store 

Ik 

58( 


M+l 


■>& 


SUBTOTAL  29Mflrl+29lt|in+   5Mf5 


Table   11.  Final  step  in  Cooley's  Fourier  Analysis  and  Synthesis* 


*The   logic   for  the  code   to  keep  current  values  in  the  ADB's   defined  in 
the  initial  conditions  is   supplied  in  GLYPNIR,   Appendix  B. 
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INITIAL  CONDITIONS 

]GISTER 

CONTENTS 

REGISTER 

CONTENTS 

REGISTER 

CONTENTS 

REGISTER 

CONTENTS 

$X 
$D0 

PI 
CI 

$D1 
$D2 

IL 
L2 

$D3 
$Dl+ 

L3 
K 

$D5 
$D6 

KM1 

)0P   ASSIGNMENT  STATEMENTS       IOPERATIONS  AND  OVERLAP               '  ™®  _JN   *  ^^r  oSf 

CLOCKS      FOR  LOOP 

7    $C2:=$D1;$C3:=$D0          CU  Register  Transfer  overlapped 
7    P(2):=RTR($C3,,P(2));       PE  Fetch,  Route,  PE  Store 

0 
20 

20fe-l)|^ 

$C1:=0;$C3:+$DO            CU  Register  clear  overlapped 

$S:=1.0/A(3)               PE  Fetch,  division 

$B:=P[$X]*$S               PE  Fetch  overlapped,  Multiplication 

0 

63 

9 

^r*i 

$C1:=$C1+1 

$A:=RTR($C1,,A(3)) 

$S:=RTR(l,,$S) 

$X;=RTR(l,,$X) 

$R:=RTR(1,,$B) 

$A:=$A-$S 

$S:=1.0/$A      %E 

$A:=P[$X]-$R 

$B;=$A*$S       %Q, 


CU  Addition  overlapped 

PE  Fetch,  overlapped,  Route 

Route 

Route 

Route 

Addition 

Division 

PE  Fetch  overlapped,  Addition 

Multiplication 


6 
3 
3 
3 
7 

56 
7 
9 


9Mn-i 


$A:=RTR($C1,,A(3)) 

$R:+$B 

$A:=$A*$B 

$C2:=$D6 

$S;=P[$X](2)-$A 

P[$X](2):=$R 


PE  Fetch  overlapped,  Route 

PE  Register  transfer  overlapped 

Multiplication 

CU  Register  transfer  overlapped 

PE  Fetch  overlapped,  Addition 

PE  Store  overlapped 


22 


&1 


$C2 : =$D2+$D3 ; $C1 : =$C1-1 


$B 
$S 
$X 


=RTL(l,,$R) 
=RTL(1,,$S) 
=RTL(1,,$X) 


%E 


$A;=RTR($C1,,A(3)) 

$R : =$B     #Q 
$A:=$A*$S 
$A;=P[$X](2)-$A 
$A;=$A=$R     %T 
P[$X](2):=$S 
$B:=$S     %Q 
$S:=$A     %E 


'CU  Addition  overlapped 

Route 

Route 

Route 

PE  Fetch,  partially  overlapped 

Route 

PE  Register  transfer  overlapped 

Multiplication 

PE  Fetch  overlapped,  Addition 

Addition 

PE  Store  overlapped 

PE  Register  transfer 

PE  Register  transfer 


0 
3 
3 
3 

10 
0 
9 
7 
7 
0 
1 
1 


U6(M-i) 


*C2:=$D1;$C3:=$D0 
P(2):=RTL($C3,,P(2)) 


CU  Register  transfer  overlapped 
PE  Fetch,  Route,  PE  Store 


0 
20 


20 


(m-dH 


SUBTOTAL 


i^^l-^farl 


+ko 


(M-l)f- 


Table  12.   CRED* 


*The  logic  for  the  code  to  keep  current  values  in  the  ADB's  defined  in 
the  initial  conditions  is  supplied  in  GLYPNIR,  Appendix  B. 
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the  boundary   conditions  by  -^  .      Then  apply   (1+8.1),    (1+8.2),   the  complex 
fast   Fourier  transform  and   (1+9)  before. and  after  CRED.      This   algorithm, 
MFACR,    is   a  modified  FACR  which  solves    (39)    for  Dirichlet's  boundary 
conditions  without   using  odd/even  reduction. 

3.5      Implementation  of  MFACR 

We  will   discuss  MFACR  in  explaining  how  it   can  be  programmed  in 
ASK  and  how  much  time   such  a  routine  would  take.      Appendix  F  contains 
GLYPNIR   code  of  the  algorithm.     Appendix  A  contains   information  on 
timing  methodology  and  explains  the  notation  used  in  Tables   9  through 
12.      Both  ¥  and  U  are  M+2  by  N+2  mesh.        ¥  and  U  are  stored  in  straight 
storage   in  C   and  P,    respectively.      MFACR   consists   of  five   steps: 

1.  Set   up  storage  area  P 

2.  Fourier  analysis  uses  f/  to   calculate  f/ 

3.  CRED  uses   yS   to  calculate  US 

h.      Fourier  synthesis   uses  U     to   calculate  U 

5.      Restore  boundary   conditions   and  C. 
Step  one  places   a  linear  combination  of  the  boundary  conditions   and  f/ 
in  the  interior  of  the   storage  area  P.      In  the  process  the  first  row 
of  U  along  with  the   second  and  N+lst   row  of  ¥  are  lost.      Thus  they  are 
saved  in  temporary  storage.      This   section  of  the  algorithm  takes 
(23M+102)  T^irl    +  !3M  clocks    in  ASK.      Section  1  of  Appendix  F  contains 
the  GLYPNIR   code. 

Fourier  synthesis   and  analysis   are  performed  on  up  to  6k  columns 
simultaneously  described  above.      They  are  presented  in  three  parts. 
First   equations    (1+8.1)   and   (1+8.2)  prepare  the  data  for  the  complex 
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* 

fast   Fourier  transform.      Table  9      contains   suggestions   on  how  these  equa- 


as 


tions  could  be  programmed  in  ASK  and  -supplies  (57M-155)  (FT  clocks 
a  time  estimate  for  this  step.  Table  10  contains  suggestions  on  how  the 
complex  fast  Fourier  transform  could  be  programmed  in  ASK.  A  time  esti- 
mate of  \tt  ^Mfl)  Log  (M+l)  +  5  M-79)  Ig^l  +  3^M  -  5^  is  given  for  this 
step.      Finally  the   implementation  of  equation    (1+9)    in  ASK  is    discussed 

in   Table  11  where  a  time   estimate  of  29 (M+l)     tt-       +   5M  +  5   is  presented. 

53 
The  total  time  estimate   for  Cooley's   algorithm  is    {-^-r-  (M+l)   Log   (M+l ) 

+  91M  -  205}    frr-l   +  39M  -  1+9   clocks.      The   GLYPNIR  code   for  Cooley's 

method  is   found  in  section  2  of  Appendix  F. 

MFACR  uses  the  modified  Thomas'   method,    equations    (1+1+) ,    (1+5),    and 
(1+7)   to  perform  CRED  on  up  to   6k   rows   simultaneously.      Table  12  suggests 
how  this  method   could  be  programmed  in  ASK  and  estimates  how  long  that 
code  would  take,    Uo(M-l)    W-     +    (ll+0N-l+6)     gr-|      clocks.      Section   3  of 
Appendix  F  contains  the  GLYPNIR  code  for  CRED. 

The   final   step  of  MFACR   consists   of  restoring  the  boundary   condi- 
tions  and  the  portions   *+/  which  have  been  altered.      Section  k  of  Appendix 
F  contains   the  GLYPNIR   code  for  this   step.      If  programmed  in  ASK  this 
step  would  take   approximately  1+2    \zr\   clocks. 

The   complete  MFACR  requires    (53/2(M-l)   Log   (M+l)   +  245M-306 }    \jt-\ 
+    (ll+0N-l+6)    YqA  +  91M   -  98    clocks.      The  timing  algorithm  is   taken  from 
Appendix  A. 


The  loop  numbers   in  Tables  9  to  12   correspond  to  the  loop  number  in 
the  GLYPNIR   code   in  Appendix  F.      They  are  supplied  to   enable  the 
reader  to   compare  loops    in  the  GLYPNIR  code  with  the   corresponding 
loops    in  the  ASK  code. 


62 


3.6     Implementation  of  FACR 

FACR,   one  level   of  odd/ even  reduction  would  require    (53/2(M-l)  log   (M+l) 
+  U05M  -  k66)      kf^j   +    (80M  +  10U)    f^fl   +    (TON  -  23)    Qf|   +  8lM 
-   Qk   clocks.      This    figure  assumes  the  odd  columns   are  solved  by  Thomas' 
method  with  the   e's   calculated  and  stored  in  preconditioning.      Thus 
the  odd  columns  would  take    (8UM  -  k6)     fTpo"     clocks.      Setting  up  the 
even   columns  would  take   79M    k^To"     •      Preparing  the  odd  rows   for  solu- 
tion would  take   55M   \Tpdi    clocks.      CRED  would  take    (70N-23)    \-rrl 


clocks.      Fourier  analysis   and  synthesis  would  take    (53/2(M-l)   log      (M+l) 
+  192M  -  420)  Ji2R       +  ^^M  "  ^      The  extra  storage  manipulation 

rNi 
would  take  1JM    ytt]  clocks.      The  rest  of  the  algorithm  is   identical  to 

MFACR. 

3. 7     Summary  of  the  Results  on  Direct   Methods 

In  programming  an  algorithm  on  ILLIAC   IV,    an  attempt    is  made  to 
adapt  the  restrictions   of  the   algorithm  to  maximize  storage  and  machine 
efficiency.      For  efficient   storage  on  ILLIAC   IV,    rows   stored  across   PEM 
should  contain  a  multiple  of  6k  words.      To  maximize  machine   efficiency 
the  number  of  values  being  computed  simultaneously  equals   6h.      Hockney's 
method  performs   a  Fourier  analysis   and  synthesis  on  each  column  and 
performs   CRED  on  each  row.      Thus  to  maximize  machine  efficiency  on 
ILLIAC   IV  the  number  of  interior  rows   and  the   number  of  interior 
columns  must  be  multiples  of  6k.      Fourier  analysis    and  synthesis 
using  Cooley's  method  requires   2     -  1  points  per  column   for  Dirichlet's 
and  Neumann's  boundary  conditions.      If  I   >   6  then  storing  the  columns 
across   PEM  would  waste  63  words  per  column.      Thus  we  will   store  each 
column   in  one  PE.      Now  we  have   columns   restricted  to  2  +1    (2    )    for 
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Dirichlet's  or  Neumann's    (periodic)  boundary   conditions.      Rows   are 
restricted  to  multiples  of  6h  to  maximize  storage  efficiency.      For 
periodic  "boundary   conditions    cyclic  reduction  saves   time  and  temporary 
storage  over  Thomas'   method  [Hockney,   Page  150],    and  restricts   the 
mesh  to   2      columns.      This   restriction  on  mesh  size  still   allows  maxi- 
mum storage  and  machine   efficiency.      Cyclic  reduction  restricts   the  mesh 
to  2     +1   for  Dirichlet's   and  Neumann's  boundary  conditions        This 
restriction  reduces   storage  efficiency  for  both  boundary  conditions   and 
it   reduces  machine   efficiency  for  Neumann's  boundary  conditions. 
Thomas'   method  and  the  modified  Thomas'   method  have  no  restrictions  on 
the  number  of  columns.      Thomas'   method  is   faster  than  the  modified 
Thomas'   method  but   requires  temporary  storage.      Refer  to  Table   8  for  a 
more   detailed  comparison  of  the  methods   available   for  use  by   CRED. 

Hockney's   FACR  program  handles  the  9   different   combinations  of 
boundary  conditions   and  performs   one  level  of  odd/even  reduction  in   all 
cases.      To  maximize   efficiency  of  such  a  program  on   ILLIAC   IV  different 
algorithms  would  be  required  for  different  mesh   sizes   and   combinations 
of  boundary   conditions. 

As   stated  earlier,    odd/even  reduction  saves   computer  time  on 
ILLIAC   IV  for  certain  mesh  sizes  but   not   for  others.      Thus   FACR  on 
ILLIAC   IV  should  have  two   algorithms,   one  which  uses  odd/even  reduction 
and  one  that  does   not.    Furthermore,   the   choice  of  a  method  to   solve 
CRED  depends  on  the  storage  requirements  of  the   individual  problem. 
It    is  the  author's   opinion  that  modified  Thomas'   Method  is  the  best 
compromise  between  storage   and  speed  for  Neumann's    and  Dirichlet's 
boundary   conditions,    and  cyclic  reduction   is  the  best   for  periodic 
boundary   conditions. 
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If  Fourier  analysis   and  synthesis    is   applied  to   columns  which  have 
Neumann's  boundary   conditions   then  there  must  he  2     +1   interior  rows 


to  be   evaluated  by  CRED.      This  means    CRED  must  be  executed    < ■■;--<- 
J  .-1  „I  iul 


I  6k   I 
(21+vl       21 

times.      Note    |    „     |     _7T+1    if  I   >   6   and  thus    for  one   execution  of 
CRED,    ILLIAC   IV  is  working  at   l/6h        of  its   capacity.      The  user  can 
avoid  this  machine  inefficiency  by  setting  up  the  program  so  that 
Fourier  analysis   and  synthesis   is   performed  in  the   direction  which  has 
non-Neumann  boundary   conditions. 

On  a  M  by  N  mesh  the  Fourier  part  of  FACR  requires  the  order  of 
NMlogpM  clocks  when  it   is   applied  to  the  N  columns   in  the  mesh.      CRED 
requires  the  order  of  NM  clocks.      By  letting  M  be  the  smaller  of  the 
dimensions  the  user   saves   computer  time. 
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k.      USE  OF  ILLIAC   DISK  FOR  NON-CORE  CONTAINED  MESHES 

When  the  problem  "being  run  becomes  too  large  to   contain  in  core, 
the  I/O  time  becomes   an   important   consideration.      First  the   I/O   system 
for  ILLIAC  IV  will  be  discussed  and  then  its   efficiency    for  a  specific 
problem  will  be   examined. 

k.l     ILLIAC   IV  I/O  System 

The  ILLIAC   IV  Disk  is   a  15,600  K  word  memory.      This  memory  is 
divided  into   52  bands.      Each  band  is    divided  into   300  pages,    each  con- 
taining l6  lines   of  PEM  (l  K  words).      Memory  transfers  between  the 
ILLIAC   IV  Disk  and  PEM  are  restricted  to  pages.     A  data  request   can 
read  or  write  up  to  128  consecutive  pages   on  one  band.      If  the  data 
is  not   consecutive  or  if  it   is   spread  over  more  than  one  band  a 
separate  request   is  needed  for  each   string  of  consecutive  pages.      The 
time  required  for  the   disk  to  prepare  to  perform  a  data  request   is 
equivalent  to  the  time  to  transfer  two  pages   of  data  between  the  Disk 
and  PEM.      Thus   if  a  data  request   reads   or  writes   on  page  i   and  band  j, 
the  next   data  request   should  skip  at   least  two  pages  and  start   at  page 
i  +  3  of  band  k,  where  j    need  not   equal  k.      If  the  second  data  request 
wanted  page   i  +  1,    or  i  +  2,   the  request  would  have  to  wait    for  a  revolu- 
tion of  the   disk  before  it   could  be   executed.      The  transfer  rate  between 
PEM  and  Disk  is  133  usee  per  page  and  the  disk  revolves  once  every  kO 
msec.      One  revolution  of  the   disk   is   equivalent  to   61+0,000  ILLIAC  IV  clocks 

^.2     1/0   for  the  Bernard-Rayleigh   convection  problem 

Now  let   us    examine  the  Bernard-Rayleigh  convection  problem  where  the 
following  equations   are  used  to  solve   for  temperature,   T,   pressure,   P, 
and  velocity   components  u  and  w  along  with  the  X  and  Z  axes: 
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£  +  u  »  +  „  f    =  i-  V2!  (50) 

9t  9x  9z  P 

r 

3n  3n   L       3n     _  Ra  9T      2  ,      » 

r-+  ur-  +  w  —     =  —  TT     V  n .  (51) 

9t  9x  9z  P      9x 

r 

n  =  |*  _  |H  =  V2*  (52) 

9X  dZ 

where  t    is   time,    R     is  the  Rayleigh  number,   P     is  the  Prandtl  number,   and 


^   is  the  stream  function  defined  by 


9i1j 


9z    '  9x 


The   finite   difference  schemes   employed  to   solve  these   equations   require 
five  meshes:      T(t),    T{x'l\    r/0,    n^    and  *(x). 


Each  time  step  of  the   convection  process  has  three  main  parts; 

(x+l)  (t+1)  (t+1) 

calculate   n  ,    calculate  T  and  calculate  ¥ 

„<^>   is  calculated  using,  W    .,   ^    jff) ,   jg^   .^W^.   T^ 
is  calculated  using  T^.,  T^,   T^T1'  *j$>Jtl,  l&J^   «^»i^l-  I 

Using  all  of  r\  J ,¥  is   calculated. 


The  process  will  be   divided  into  two  parts.      First  the  prognostic 

ain  ri 
(t+1) 


equations    (50 )    and   (5l)    are   solved  to  obtain  ri  and  T  .      Then 


Pois son's   equation,    (52),    is   solved  for  \F 

The  prognostic   equations   require   five  meshes  while  Poisson's 
equation   requires  two  meshes    if  solved  by   an  iterative  method  and  one 
mesh   if  solved  by   a  direct  method.      Thus  non-core  contained  problems 
can  be  divided     into  two  types.      In  the   first  type,  the  meshes   required 
for  solving  Poisson's   equation   are   core   contained  but    some  of  the  meshes 
required  for  solving  the  prognostic  equations   are  not   contained  in   core. 
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In  the   second  case,   the  meshes   are  so  large  that  none  of  the  meshes    can 
be  core   contained. 

In  [Ogura,  et  al]  the  first  type  of  problem  was  studied  in  which  SOR 
was  used  to  solve  Poisson's  equation.  If  a  direct  method  had  been  used  in 
that   study,   the  maximum  mesh   size   could  be  doubled. 

In  this   study  the   second  type  of  I/O  problem  will  be   examined  where 
MFACR  is   used  to   solve  Poisson's   equation.      For  an  example  we  will  use 
512  by  512  meshes.      P.     .  where  1   <   i   <   6U  and  0   <   j   £   3  is   a  page   of  a 
mesh   containing  all  points   a         where  8.    -   8  <  k    <  8.    and   128  j   <  £  <  128  j  +  ]28. 

If  the  prognostic   equations   are   solved  separately  using  ASK  code  they 
are  found  to  be  about    50%  I/O  bound.      To   decrease  the   I/O  bound  the  Fourier 
analysis  of  MFACR  will  be  performed  along  with  the   calculation  of  the  prog- 
nostic  equations. 

This  will  be   followed  by  CRED,  with  the   final   step  being  Fourier 
synthesis. 

The  data  will  be   stored  in  blocks  22  pages  long.      Block  number  I,   B 
contains   P         where   0   <   j    <   3  of  n        ,    T(      ,    n     ~      ,   T(x~      ,    and  <TT'    for 
time  step   r.      Also   included  in  B     are  the  two  pages   required  for  the  read 
or  write  interpret.    Thirteen  blocks   can  be  placed  on  one  band  with  Ik  pages 
remaining.      The   data  will  be   spread  across  the  11  bands  of  disk  with  22 
pages   separating  each  block  to   leave  room  for  the   information   from  PEM  to 
be  written.      Seven  blocks   of  data  and  six  blocks   for  writing  are   found  on 
bands   0,   2,    U,    6,    and  8.      Six  blocks  of  data  and  seven  blocks   for  writing 
are  found  on  bands   1,    3,    5,    and  7.      Band  9   contains   5  blocks  of  data  and 
seven  blocks   for  writing.      Band  10   contains  the  last  block  for  writing. 
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The  blocks  are  skewed  to  enable  efficient  access  of  rows  and  columns . 
The  first  block  on  band  I  starts  at  page  -21  mod  300. 

The  prognostic  equations  and  the  Fourier  analysis  are  performed  while 
the  disk  revolves  ten  times  and  110  pages  are  passed  on  the  eleventh  revolu- 
tion.  Figure  6  shows  what  data  will  be  printed  on  the  first  kh   pages  of 
each  band. 

Only  one  column  of  pages  can  fit  in  PEM  at  one  time.   Since  CRED 
requires  an  entire  column  before  the  calculations  can  be  performed,  a 
column  is  read  into  PEM,  the  calculations  are  performed  and  the  column  is 
printed  on  disk.   The  data  has  been  skewed  in  such  a  way  as  to  enable  an 
entire  column  to  be  read  or  written  in  one  revolution,  see  Figure  6. 
There  are  four  columns  and  the  calculation  of  each  takes  l/k   of  a  revolu- 
tion.  The  total  I/O  time  for  CRED  is  nine  revolutions. 

The  final  step,  Fourier  Synthesis,  is  performed  in  four  revolutions. 
Figure  6  shows  that  this  can  be  accomplished  on  the  first  kk   pages  by  cal- 
culating rows  5,  31  and  57  on  the  first  revolution,  rows  18  and  kk   on  the 
second  revolution,  rows  12,  38,  6k   on  the  third  revolution  and  rows  25,  51 
on  the  last  revolution.   To  enable  Y^T+  '  to  overwrite  >}AT+1/3J  Row  j  of 
vjAt  J 5)    lg  written  3  data  blocks  before  Row  I  of  ^T+1/3)  during  CRED.* 

The  entire  time  step  required  23  revolutions  and  110  pages  on  the  2l+th 
revolution.   This  takes  932  milliseconds.   The  calculation  could  be  per- 
formed in  333  milliseconds.   Thus  the  process  is  6k%   I/O  bound. 

A  large  portion  of  this  I/O  bound  is  created  by  the  solution  of 
Poisson's  equation.   The  last  two  steps  of  MFACR  required  13  revolutions 
of  the  disk  while  the  calculations  for  those  steps  required  the  amount 

of  time  for  2^  revolut ions . 

*   (t+1/3) 

y        contains  the  harmonics  calculated  by  Fourier  analysis. 

(t+2/3) 
V        contains  the  harmonics  calculated  by  CRED. 
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To  minimize  total  computer  time     a  problem  takes,  a  programmer  must 
adapt   algorithms  to   disk  maps  which  give  minimum  I/O  time.      With  I/O 
being  such  a  dominant    factor  in  the  above  problem  the  use  of  library- 
subroutines  becomes  minimal  because  programs  need  to  be  customized  to 
minimize  I/O. 


Figure   6.      A  Disk  Map  of   512  by    512  Meshes 
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5.      CONCLUSIONS 

Table   13   compares   MFACR   and  FACR  with   SOR   and  ADI    for  Dirichlet's 
boundary    conditions.      The   direct   methods    use  modified  Thomas'    method  to 
perform  CRED   and  Cooley's   methods   to  perform  the  Fourier  part   of  the 
algorithm.      Unless    an   excellent    initial    guess    is    supplied  for  the   itera- 
tive methods,    the   direct   methods    are  much    faster. 

For  maximum  machine   efficiency,    MFACR  requires   64k  by   2   +1 
meshes,    FACR  requires   128k  by  2   +1  meshes,    ADI  requires   64k  by  641 
meshes,    and  SOR  requires    64k  by   j    meshes.        As    seen   in   section   2,    SOR 
can  partition   its  mesh  to  obtain  maximum  machine   efficiency.      If  the  mesh 
for  one  of  the  other  methods   did  not  meet   the   row  restrictions,    the  mesh 
could  not  be  partitioned  to  meet   the  restrictions.      A  possible   solution 
would  be  to    solve   a  number  of  meshes    at   the    same  time. 


(MESH  SIZE 
M+2  by  N+2 

FACR 

MFACR 

SOR  PER 
ITERATION 

ADI  PER 
ITERATION 

33  by  63 

6.51  or 
25300 

5.71  or 
22400 

I  or 
3900 

3.81  or 
14900 

65  by  63 

5.31  or 
49500 

4  .91  or 
39100 

I  or 
7900 

2.51  or 
20100 

65  by  127 

3.71  or 
59500 

4.71  or 
74000 

I  or 
15900 

2.61 
40700 

129  by  127 

3. 81  or 
122000 

4.91  or 
156000 

I  or 
32000 

2.71  or 
87700 

Table  13.    A  Comparison  of  Methods  with  Respect   to  Time* 


Two   numbers   are   supplied.      The  number  times   I   is  the  number  of 
iterations  which   could  be   performed  in   an  equivalent   amount   of  time. 
The  other  number   is   the   number  of  clocks   required  on   ILLIAC   IV.      The 
times   given  for  SOR   in   section  2.3  don't  take   into   account   the  time 
required  to   check   for  convergence.      To   calculate  the  times    for  SOR 
in   Table  13,20%  has  been  added  to  the  times   in  section  2.3  to   account 
for  convergence   checking. 
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Direct  methods  require  only  one  mesh  in  core.   The  mesh  initially  con- 
tains the  interior  source  points  and  the  boundary  conditions.   Throughout 
the  algorithm,  the  mesh  is  used  as  temporary  storage  with  the  values  in 
the  mesh  set  equal  to  the  solution  in  the  final  step.   Iterative  methods 
require  "both  the  source  and  the  solution  mesh  at  every  iteration. 
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APPENDIX  A 
TIMING  METHODOLOGY 

The  procedure  for  estimating  the  amount  of  time  required  by  an 
algorithm  is  explained  below.   Care  has  "been  taken  to  insure  that  the 
method  of  calculating  the  computer  time  does  not  give  one  algorithm  an 
unwarranted  advantage  over  another. 

In  determining  the  amount  of  time  an  algorithm  takes  on  a  computer 
we  must  consider  the  language  in  which  the  algorithm  is  programmed.   In 
comparing  GLYPNIR  to  ASK,  one  finds  that  the  average  GLYPNIR  code  takes 
about  twice  as  long  as  the  ASK  code  for  the  same  problem.   Two  major 
reasons  for  this  difference  are  memory  fetching  and  indexing.   The 
code  generated  by  GLYPNIR  for  indexing  can  be  improved  considerably  by 
rewriting  it  in  ASK.   GLYPNIR  does  fetching  and  storing  of  data  between 
PEM  and  PE  registers  that  can  be  eliminated  by  efficient  use  of  the  PE 
registers  in  ASK. 

GLYPNIR  code  is  much  easier  to  write  than  ASK.   One  must  choose 
between  simplicity  and  speed.   However,  GLYPNIR  permits  the  programmer 
to  write  sections  of  code  in  ASK.   Thus  the  programmer  can  write  code  in 
GLYPNIR  and  then  rewrite  the  statements  which  are  executed  most  often  in 
ASK.   In  estimating  the  time  for  the  different  methods,  we  have  timed 
the  statements  which  are  executed  in  the  inner  loops  of  the  iterative 
method.   Since  these  statements  take  most  of  the  computer  time  they 
should  be  written  in  ASK.   We  use  the  GLYPNIR  statements  as  an  outline 
of  the  method  but  the  time  estimates  reflect  efficient  ASK  code,  not  the 
code  that  GLYPNIR  would  generate.   To  show  the  reader  how  we  arrived  at 
our  time  estimates  we  have  included  a  table  with  assignment  statements, 
operations  which  are  not  overlapped  and  the  clocks  required  for  each  method, 


A-.: 


The  assignment  statements  are  written  in  a  combination  of  GLYPNIR  and  ASK. 
The  parts  of  the  statements  -which  are  not  used  in  GLYPNIR  are  defined  in 
Table  lU.   A  summary  of  the  assumptions  used  in  obtaining  the  time  esti- 
mates is  found  in  Table  15. 


NOTATION 

MEANING 

NOTATION 

MEANING 

$A 

PE  Register  A 

$Ci 

ACAR  i  for  i=0,l,2,3 

$B 

PE  Register  B 

U(i) 

U  indexed  by  ACAR  i 

$R 

PE  Register  R 

u[$x] 

U  indexed  by  register  X 

$S 

PE  Register  S 

D[$X](i) 

both  $X  and  $Ci  indexing  of  U 

$X 

PE  Register  X 

$Di 

ADB  storage  location  i  for  i=0, . 

-,63 

Table  1^'  Assignment  Statement  Notation 

Every  iterative  algorithm  we  consider  in  this  study  will  have  two  parts: 
Phase  1  will  be  used  to  improve  6h   points  simultaneously;   Phase  2  will 
improve  up  to  63  points  simultaneously.   The  machine  efficiency  of  Phase  1 
is  greater  than  the  machine  efficiency  of  Phase  2.   If  one  used  only  Phase  1 
on  certain  meshes  the  extra  time  needed  to  bring  the  data  to  the  PE's  would 
exceed  the  time  saved  by  improving  machine  efficiency.   Thus  the  fastest 
cede  is  obtained  by  a  combination  of  Phase  1  and  Phase  2. 


1)  CU  instructions  are  overlapped  completely  by  a  combination  of  FINST/ 
PE  instructions  and  PEM  fetches. 

2)  Memory  fetching  is  overlapped  as  much  as  possible. 

3)  Seven  PE  clocks  are  used  to  load  (Jst-drare)  a  PE  register  from  (to)  PEM. 
k)     One  PE  clock  issued  to  load  a  PE  register  from  the  CU. 

5)  Transfer  between  PE  registers  require  one  clock. 

6)  Each  GRABONE  will  be  considered  to  be  an  ASK  LOAD  instruction 
(  10  clocks). 


Table  15.   Assumptions  in  Timing 


The  author  feels  these  timing  assumptions  should  give  times  within  20% 
of  the  actual  times  the  algorithms  would  take  on  ILLIAC  IV. 


A- 3 


To  aid  in  comparing  times  of  different  methods  we  will  define  an 
execution  of  Phase  1  to  he  the  improvement  of  6k   points  simultaneously  hy 
Phase  1.   Similarly,  an  execution  of  Phase  2  is  the  improvement  of  k  points 
where  Phase  2  improves  k  points  simultaneously. 


B-l 

APPENDIX  B 

MAJOR  STEPS  OF  FACR  AND  MFACR  FOR  DIRICHLET'S  BOUNDARY  CONDITIONS  ' 


Given  the  meshes    ¥  and    U  with  mesh  points   at   locations    ¥.        and  U 
where  0   <  i  ^  M  and  0  ^  j    ^  N  the  five  point   difference  scheme  approxima- 
tions of  Poisson's   equation  becomes 

Vl.t     -    2"s.t     +    Vl.t  ,      °B.t-l     -    2US,t     +    "s.t+1  T  ,■       ' 

h2  h2  »•*  ' 

x  y 

To  solve  (B.l)  FACR  works  as  follows. 

Odd/even  reduction  (see  Section  3.1 )  calculates  y     for  the  even  columns 
of  ¥.   Equation  (Ul )  from  Section  3.1  can  be  expanded  into  the  following 
eauation  for  individual  mesh  points. 

1 
-(Ui,t-2    +    Ui,t+2}    -:VUi-2,t    +    V2,t} 


y 


* 


*>&  +  *)  «vl!t  +  w  -  f" ^ +  r* +  M u^ =  f"i.*  (B'2) 


h  h 

X  x 


h  h  h 

x  x  y 


where 

*   .        =  ¥  -  -2—  ft  +   V  1    +  ?l   -X_  *  il    vu  4.  u/  (B.2.1) 

*•*  i.t-1        h   2    ( Vl,t        Vl,t}        2K2   +   4     *i,t   +   *l,t-l 

Fourier  analysis    calculates  the   Fourier  harmonics   YS    and  ^*  defined  by 
M-l 

¥  i,t    m^  *  k,t  SIWir  •  (B-3) 

Combining   (B.2)    and   (B.3)   we  obtain 

A,t  ■  I T [s™  *f  [fs  \,t.a  ♦  w  -  j4  <v2>t +  w 

y  x 


B-2 


5 

The  Fourier  harmonics  for  U,  U   are  defined  as  follows: 
M-l 

K. — X 


and  thus 

M-l 


irilk  TTs 


Combining   (B.5),    (B.6)    and   (B.k)   we  obtain 


iS  1      /TTs  ,    TTs 


h2 


=  _±_  (us        +  us       )  -    6  -V  +  -~  +  -—.  |  it 

i,t        h   2    l    i,t+2  i,t-2;  ^T       h   2        h   2    J      i,L 

y  *         x  x  y 

r    M-l_  .  .     r      h  2     M-l  ,_     _»  ,     oU 

M  k=lL  M     1        ii         1=1  £'  £' 


rh2    1 

h  h       /  1= 

x  y 


j   (SIN  ^%^%:   +    +   SIN  ^-^     U*  j)l 


(B.7) 


Using  the   facts  that   Sin(a±b)   =   Sin   a  Cos   b   ±  Sin  b      Cos   a  and 

M-l 

r        _  .         7r£k    _  .         TTJk  N     -  .      .  -,         o  iv/r    n         -u 

2   Sin  —  Sin  —  =  g  6i  £  for  £>1  =  lj  2>  •"»  M_1  where 
k=l  ' 


f  0  if  i  f 

i,  1^1   if  i  =  J 

system 


£ 
5i  ^   1  if  i    =   p     ve  can     decouple    (B.T)    into   a  tridiagonal  linear 


h  2  /h 


fi,t.=  ^^,t+2+ui,t-2)-  K^sf   .b^^JcobS 
y  '      x  \   x        x 

h   2 

Ii  h   d  h   2  I 

x  x  y 

CRED  solves    (B.8)    for  US. 
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g 

Fourier  synthesis  calculates  U  from  U  using  (B.6). 


This  gives  us  the  values  for  U  on  the  even  columns. 

Using  the  values  of  U  on  the  even  columns  we  can  solve  (B.l)  for 
the  values  of  U  on  the  odd  columns  using  a  method  which  solves  tridiagonal 
linear  systems. 

In  summary  FACR  performs  the  following  algorithm: 

1.  Given  ¥  compute  ¥  for  the  even  columns  using  (B.2.1)  and  overwrite 
y  with  y  on  the  even  columns. 

2.  Using  4*  compute  ¥  according  to  (B.3)  and  overwrite  y1  with  Y  . 

O  C  C  C! 

3.  Using  y  solve  (B.8)  for  IT  and  overwrite  ^  with  U  . 

h.      Using  u  compute  U  according  to  (B.6)  on  the  even  columns  and  over- 

write  U  with  U  on  the  even  columns. 
5.   Using  the  values  of  U  on  the  even  columns  solve  (B.l)  for  the  values 

of  U  on  the  odd  columns  and  overwrite  V  with  U  on  the  odd  columns. 

MFACR  is  similar  to  FACR  but  the  odd/even  reduction  step  is  omitted. 
MFACR  begins  with  Fourier  analysis  which  calculates  the  Fourier  harmonics 
y  for  f  using 

o  M-l 

y''i,t  ""  M  ,  *,   \,t  "*"■"  M 

k=l 

From  B.  9    ve  obtain 

M-l  /U  -  2U  +  U 

i,t       M  v£n  M     V  h   2 


SINlki  (B.9) 


k=l  h 

x 


Us,t-1  -    2Us,t   +Us,t+l\  (B.10) 


h2 

y 


B-U 


Using  the   same  methods   employed  to  obtain    (B.8)    from   (B.T)we  obtain 
(B.H)    from    (B.10). 

s  1         /    s  s  v     ,     /    2      „_    TTi  2  2   \    s  / 

*i,t  "  ^2    (ui,t+i  +  "i.t-i'  +  (^T  cos  -m  "  —  "  -zH,t      f8-11' 
y  y  y         x/ 


CRED  solves    (B.ll)    for   if      . 

i  »t 

Finally    (B.l)    is   used  to  calculate    U  for  the   entire  mesh. 
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